2  Working with data in R

2.1 By the end of this section, you will be able to:

2.2 In this section, we will use the Tidyverse set of packages

  • A ‘toolkit’ of packages that are very useful for organsing and manipulating data
  • We will use the haven package to import SPSS files
  • We will use the dplyr to organise data
  • Also includes the ggplot2 and tidyR packages which we will use later

To install:

install.packages(“tidyverse”)

(See the previous section on installing packages)

2.3 Import data into R from excel, SPSS and csv files

We can import data from a range of sources using the Import Dataset button in the Environment tab:

Importing data

It is also possible to import data using code, for example:

` # importing a .csv file

   library(readr)
    studentData <- read_csv("Datasets/studentData.csv")

    #importing an SPSS file
    
    library(haven)
    mySPSSData <- read_sav("datasets/salesData.sav")

Once the data are imported, it will be visible in the environment:

Imported data in the environment

2.4 Restructuring and reorganising data in R (long versus wide data)

2.5 Understanding objects in R

In R, an objectA word that identifies and stores the value of some data for later use. is anything that is saved to memory. For example, we might do some analysis:

mean(happiness)

However, in the example above, the result would appear in the console but not be saved anywhere. To store the result for reuse later, we save it to an object:

happinessMean <- mean(happiness)

In the above code (reading left to right):

  • We name the object “happinessMean”. This name can be anything we want.
  • The arrow means that the result of the code on the right will be saved to the object on the left.
  • The code on the right of the arrow calculates the mean of happiness data

When this code is run, happinessMean will be stored in the environment window:

Storing the result of a calculation in the environment

To recall an object from the environment, we can simply type its name. For example:

 happinessMean
[1] 10.29769

Its important to note that anything can be stored as an object in R and recalled later. This includes, dataframes, the results of statistical calculations, plots etc.

2.6 Identify different data structuresA data structure that aggregates data, such as a vector, list, matrix, or data frame and variable types

2.6.1 Data structures (sometimes referred to as “data containersA data structure that aggregates data, such as a vector, list, matrix, or data frame”)

There are many different types of data structures that R can work with. The most common type of data for most people tends to be a data frame. A data frame is what you might consider a “normal” 2-dimensional dataset, with rows of data and columns of variables:

A dataframe example

R can also use other data structures.

A vector is a one-dimensional set of values:

# a vector example

scores <- c(1,4,6,8,3,4,6,7)

A matrix is a multi-dimensional set of values. The below example is a 3-dimensional matrix, there are 2 groups of 2 rows and 3 columns:

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

We will primarily work with dataframes (and sometimes vectors), as this is how the data in psychology research is usually structured.

2.6.2 Types of numerical data

With numerical data, there are 4 key data types:

Numerical data types R can use all of these variable types:

  • Nominal variables are called factors
  • Ordinal variables are called ordered factors
  • Interval and ratio variables are called numeric data and can sometimes be called integers (if they are only whole numbers) or doubles (if they all have decimal points)

R can also use other data types that are not numerical such as text (characterA data type representing strings of text.) data.

2.6.3 Convert variables from one data type to another

When we first import data into R, it might not recognise the data types correctly. For example, in the below data, we can see the intervention variable :

participant intervention happiness
4 2 6.245260
7 2 8.745944
9 2 8.906846
13 2 9.199057
8 2 9.301780
5 2 9.381039
16 1 9.446345
3 1 9.909773
18 2 10.017880
17 2 10.075152

In the intervention variable, the numbers 1 and 2 refer to different intervention groups. Therefore, the variable is a factor (data type) variable. To ensure that R understands this, we can resave the intervention variable as a factor using the as.factor() function:

happinessSample$intervention <- as.factor(happinessSample$intervention)

2.7 Working with dataframes

Dataframesa 2-dimensional dataset, usually with rows of data and columns of variables are the more standard data format that were are used to (think of how a dataset looks in SPSS or Excel).

In a dataframea 2-dimensional dataset, usually with rows of data and columns of variables, variables are columns and each row usually reperesents one measurement or one participant.

2.7.1 View dataframe

To view a dataframe, we can click on it in the environment window and it will display:

Clicking on datasets in hte environment will open them up for viewing

Viewing a dataframe

2.7.2 Refer to variables (columns) in a dataframe

Columns in a dataframe are accessed using the “$” sign. For example, to access the happiness column in the happinessSample dataframe, we would type:

happinessSample$happiness
 [1] 11.580517 11.947034  9.909773  6.245260  9.381039 11.515421  8.745944
 [8]  9.301780  8.906846 11.011479 10.726459 11.337853  9.199057 11.120169
[15] 11.563120  9.446345 10.075152 10.017880 11.284192 12.638480

As we can see above, the result is then displayed.

2.8 Order, filter and group data

If you have the tidyverse package loaded, it is easy to organise and filter data.

arrange(happinessSample, happiness)
arrange(happinessSample, desc(happiness)) # Arrange in descending order
participant intervention happiness
4 2 6.245260
7 2 8.745944
9 2 8.906846
13 2 9.199057
8 2 9.301780
5 2 9.381039
16 1 9.446345
3 1 9.909773
18 2 10.017880
17 2 10.075152
11 2 10.726459
10 2 11.011479
14 1 11.120169
19 2 11.284192
12 1 11.337853
6 2 11.515421
15 2 11.563120
1 1 11.580517
2 1 11.947034
20 1 12.638480
participant intervention happiness
20 1 12.638480
2 1 11.947034
1 1 11.580517
15 2 11.563120
6 2 11.515421
12 1 11.337853
19 2 11.284192
14 1 11.120169
10 2 11.011479
11 2 10.726459
17 2 10.075152
18 2 10.017880
3 1 9.909773
16 1 9.446345
5 2 9.381039
8 2 9.301780
13 2 9.199057
9 2 8.906846
7 2 8.745944
4 2 6.245260
  • Show clients with a happiness score of less than 4
filter(happinessSample, happiness < 4)
participant intervention happiness
  • Show Intervention group 2 with happiness scores above 7
filter(happinessSample, happiness > 7 & intervention == 2)
participant intervention happiness
5 2 9.381039
6 2 11.515421
7 2 8.745944
8 2 9.301780
9 2 8.906846
10 2 11.011479
11 2 10.726459
13 2 9.199057
15 2 11.563120
17 2 10.075152
18 2 10.017880
19 2 11.284192
  • Group by intervention and show the mean happiness score
happinessSample %>% group_by(intervention) %>% summarise(mean = mean(happiness))
intervention mean
1 11.140025
2 9.844125

2.9 Create new variables / objects from data

To create new variables from data, we can use the mutate() function.

For example, let’s say we wanted to calculate the difference between each person’s happiness score and the mean happiness score.

We could do the following:

happinessSample %>% mutate(difference = happiness - mean(happiness))
participant intervention happiness difference
1 1 11.580517 1.2828274
2 1 11.947034 1.6493438
3 1 9.909773 -0.3879166
4 2 6.245260 -4.0524304
5 2 9.381039 -0.9166514
6 2 11.515421 1.2177310
7 2 8.745944 -1.5517460
8 2 9.301780 -0.9959098
9 2 8.906846 -1.3908438
10 2 11.011479 0.7137887
11 2 10.726459 0.4287693
12 1 11.337853 1.0401634
13 2 9.199057 -1.0986329
14 1 11.120169 0.8224791
15 2 11.563120 1.2654296
16 1 9.446345 -0.8513449
17 2 10.075152 -0.2225381
18 2 10.017880 -0.2798103
19 2 11.284192 0.9865019
20 1 12.638480 2.3407900