The second assessment for the module ‘Data Management & Analytics’ was ‘Data Analytics with R’. This assessment is based on the programming language ‘R’ and the first requirement of the assessment was to undertake the completion of the online course ‘Try R’ which is run by Code School.
The course is broken down into the following 7 levels:
(1) Using R – the basics of R expressions, creating and accessing variables, calling functions. Running pre-made scripts and accessing R’s help functionality.
(2) Vectors – a vector is simply a list of values, I looked at the basics of manipulating vectors – creating and accessing them, doing math with them, and making sequences. Making bar plots and scatter plots with vectors. Looking at how R treats vectors where one or more values are not available.
(3) Matrices – a matrix is a 2-dimensional array, I looked at how to create matrices from scratch, and how to re-shape a vector into a matrix, how to access values within a matrix one-by-one, or in groups, look at ways to visualise a matrix’s data.
(4) Summary Statistics – let you show how your data points are distributed, without the need to look closely at each one, I looked at the functions for mean, median, and standard deviation, as well as ways to display them on graphs.
(5) Factors – help divide data into groups. I saw how to create them, and how to use them to make plots more readable.
(6) Data Frames – with massive data sets you need powerful tools to organise it. With data frames, R gives you exactly that. I’ve learned how to create and access data frames, how to load frames in from files, and how to cobble multiple frames together into a new data set.
(7) Real-World Data – I looked at real-world data sets, and tested whether they’re correlated with
cor.test, how to show that correlation on plots, with a linear model.
Below image is screen grab evidence of R course completion!
Software and Data Sets
Once I had the online course completed, I then needed to install the R software. I first installed R and when checking out some YouTube tutorials, I saw that R Studio had a nicer interface, so I deleted R and downloaded and installed R Studio for mac. Now that I was up and running and having completed a few online tutorials, I now needed to grab some interesting data to use as a ‘use case’ with R Studio. I previously undertook a course in criminal and forensic psychology, so decided to head down the macabre path of data sets related to serial killers! Specifically serial murder in the United States of America. Being the legit academic that I am, the information I used is taken form the serial killer database from Radford University, located in Radford Virginia, United States of America. This is an up to date database and ‘As of September 4, 2016, the database contains information on 4,743 serial killers and 13,105 victims of serial killers. The statistics in this report are based on the serial killer definition derived by the FBI at its 2005 symposium in San Antonio, Texas : The unlawful killing of two or more victims by the same offender(s), in separate events.’
The data set I looked at was ‘Serial Killer Frequency by Decade’ See below (Figure 1).
For the purpose of what I was attempting, I was only interested in the ‘Decade’ and ‘US’ columns, however I created a spreadsheet with the full data set in Microsoft Excel and saved it as a csv file. In order to start playing with the data and attempting different types of graphs, you must first import your csv data file. See below (Figure 2).
The following graphs are what came out of the experimentation and learning curve. I’ll then attempt to show what information can be gleamed from the dataset.
The most glaring bit of information from the data set is that in the decade of the 80’s, the serial killer frequency stands and 768, followed second by the decade of the 90’s where the serial killer frequency stands at 669. From the 1900’s – 1950’s the figure does not raise above 72 and then form the 60’s onwards we see a substantial increase, peaking in the 80’s.
What other ideas/concepts could be represented via R Graphics if I had more time?
Well, I would like to attempt to show the correlation between serial killing in popular culture from the 70’s/80’s and 90’s i.e. serial killers in cinema, music and books to the actual real life/true crimes of serial killing. Is there a link between the two? Did the increase in numbers as demonstrated in my graphs have anything to do with the public interest in the subject in pop culture? (All US based).
Code School, (2016). ‘Try R’. Available at: https://www.codeschool.com/courses/try-r (Accessed Nov. 2016).
Maamodt.asp.radford.edu, (2016). ‘Serial Killer Statistics’. Available at: http://maamodt.asp.radford.edu/Serial%20Killer%20Information%20Center/Serial%20Killer%20Statistics.pdf (Accessed Nov. 2016).