2  Working with data in R Studio

Working with your data in RStudio is a little bit different from working with data in a spreadsheet, for example. One crucial difference is that you need to be explicit about what you want to do with your data. In a spreadsheet, you can simply click on a cell and start typing. In R, you need to tell the software what you want to do with your data. This can be a bit intimidating at first, but it is also one of the most powerful features of R. It allows you to automate repetitive tasks and perform complex analyses with just a few lines of code.

2.1 Importing data into R

At the end of this section, you will be able to:
  • Load data into R from different file types
  • Understand the structure of data in R

There are a few different ways to load data into R. You can load data from a file on your computer, from a URL, or from a package. You can load data in different file types, such as CSV, Excel, and SPSS files.

Using RStudio, you can load data by clicking on File > Import Dataset. This will open a window where you can select the file that you want to load.

However, you can also load data using code. For example, you can use the read_csv() function from the readr package to load a CSV file into R. You can use the readxl package to load an Excel file, and the haven package to load an SPSS file.


# Load the readr package

library(readr)

# Load a CSV file into R

data <- read_csv("data.csv")

Let’s break this code down:

  • library(readr) loads the readr package into your R session. This package contains the read_csv() function, which you can use to load a CSV file into R.

  • read_csv("data.csv") reads the CSV file called data.csv into R. The data will be stored in an object called data.

When you load data into R, it will be stored as a data frame. A data frame is a type of object in R that is used to store tabular data. It is similar to a spreadsheet in Excel, with rows and columns.

2.2 How are data stored in R?

If you worked through the previous section, you should already have some idea how to load data into R. But how are data stored in R? In R, data are stored in objects. An object is a container that holds data. There are several types of objects in R, but the most common ones are:

  • Vectors (e.g., a sequence of numbers)
  • Matrices (e.g., a table of rows and colummns, all of the same data type)
  • Data frames (e.g., a table of data where each column represents a variable and each row represents an observation)
  • Lists (e.g., a collection of objects)

In this section, we will focus on data frames, which are the most common way to store data in R. A data frame is a table of data where each column represents a variable and each row represents an observation. You can think of a data frame as having a structure similar to a spreadsheet.

Data frame with 3 variables/columns

2.3 How do we use data frames in R?

To view the data in a data frame, you can simply type the name of the data frame in the console and press Enter. For example, if you have a data frame called my_data, you can view the data in the data frame by typing my_data in the console and pressing Enter.

## load the tidyverse package

library(tidyverse)

# Create a data frame
my_data <- data.frame(
  name = c("Alice", "Bob", "Charlie", "David", "Eve", "Frank"),
  age = c(25, 30, 35, 40, 45, 50),
  height = c(160, 175, 180, 165, 170, 190),
  car = c("Electric", "Petrol", "Electric", "Petrol", "Petrol", "Electric")
)

# View or refer to the data in the data frame

my_data
     name age height      car
1   Alice  25    160 Electric
2     Bob  30    175   Petrol
3 Charlie  35    180 Electric
4   David  40    165   Petrol
5     Eve  45    170   Petrol
6   Frank  50    190 Electric

In the code above, we created a data frame called my_data with four variables: name, age, height, and car. We then used the my_data object to view the data in the data frame.

2.4 View or refer to a specific variable in a data frame

To view or refer to a specific variable in a data frame, you can use the $ operator. For example, if you want to view the age variable in the my_data data frame, you can type my_data$age in the console and press Enter.

# View or refer to a specific variable in a data frame

my_data$age
[1] 25 30 35 40 45 50

2.5 Data types in R

In R, each variable in a data frame has a data type. The most common data types in R are:

  • Numeric: for continuous variables (e.g., age, height)
  • Factor: for categorical variables
  • Logical: for binary variables (TRUE or FALSE)

You can use the str() function to view the structure of a data frame, including the data types of each variable.

# View the structure of a data frame

str(my_data)
'data.frame':   6 obs. of  4 variables:
 $ name  : chr  "Alice" "Bob" "Charlie" "David" ...
 $ age   : num  25 30 35 40 45 50
 $ height: num  160 175 180 165 170 190
 $ car   : chr  "Electric" "Petrol" "Electric" "Petrol" ...

In the code above, we used the str() function to view the structure of the my_data data frame. The output shows the data types of each variable in the data frame.

2.6 Convert data types in R

You can convert the data type of a variable in R using the as. functions. For example, you can convert a character variable to a factor variable using the as.factor() function.

# Convert a character variable to a factor variable

my_data$name <- as.factor(my_data$name)

my_data$car <- as.factor(my_data$car)

In the code above, we converted the name variable in the my_data data frame from a character variable to a factor variable using the as.factor() function.

2.7 Subsetting data in R

At the end of this section, you will be able to:
  • Filter data in R
  • Create subsets of data in R

Subsetting data in R means selecting a subset of the data based on certain criteria. For example, you might want to select only the rows where a certain variable is greater than a certain value, or only the columns that contain certain variables.

If we use the my_data data frame from the previous section, we can subset the data to select only the rows where the age variable is greater than 30.

# Filter the data frame to select only the rows where the age variable is greater than 30

# this method uses the dplyr package, which is a part of the tidyverse. Be sure to load the tidyverse package if you haven't already.

my_data %>% filter(age > 30)
     name age height      car
1 Charlie  35    180 Electric
2   David  40    165   Petrol
3     Eve  45    170   Petrol
4   Frank  50    190 Electric

Let’s break this code down:

  • my_data is the data frame that we want to subset.
  • %>% is the pipe operator, which is used to pass the data frame to the next function. This allows us to link multiple steps together in a single line of code.
  • filter(age > 30) is the function that filters the data frame to select only the rows where the age variable is greater than 30.

The output of this code will be a new data frame that contains only the rows where the age variable is greater than 30. However, this new data frame will not be saved anywhere, so if you want to save it, you need to assign it to a new object. Tp do this, you can use the assignment operator <-.

# Filter the data frame to select only the rows where the age variable is greater than 30 and save the result to a new data frame called new_data

new_data <- my_data %>% filter(age > 30)

In this code, on the left side of the assignment operator <-, we have new_data, which is the name of the new data frame that will contain only the filtered subset of the data (i.e., the values where the age variable is greater than 30). The difference between this code and the previous code is that we are now saving the result to a new data frame called new_data, instead of just printing it to the console.

We can also combine multiple conditions when subsetting data. For example, we can select only the rows where the age variable is greater than 25 and the height variable is greater than 175.

# Filter the data frame to select only the rows where the age variable is greater than 25 and the height variable is greater than 175

my_data %>% filter(age > 25 & height > 175)
     name age height      car
1 Charlie  35    180 Electric
2   Frank  50    190 Electric

There are many other ways to subset data in R, depending on the criteria you want to use. For example, you can use the select() function to select specific columns, the arrange() function to sort the data, and the mutate() function to create new variables. We will cover some of these functions in later sections.

2.8 Grouping and summarising data in R

At the end of this section, you will be able to:
  • Group data in R
  • Summarise data in R

Grouping and summarising data in R means grouping the data by one or more variables and then calculating summary statistics for each group. For example, you might want to calculate the mean age for each group of people based on their height.

If we use the my_data data frame from the previous section, we can group the data by the car variable and then calculate the mean age for each group.

# Group the data frame by the car variable and calculate the mean age for each group

my_data %>% group_by(car) %>% 
  summarise(mean_age = mean(age)) %>%
  ungroup()
# A tibble: 2 × 2
  car      mean_age
  <fct>       <dbl>
1 Electric     36.7
2 Petrol       38.3

Let’s break this code down:

  • my_data is the data frame that we want to group and summarise.
  • %>% is the pipe operator, which is used to pass the data frame to the next function. This allows us to link multiple steps together in a single line of code.
  • group_by(car) is the function that groups the data frame by the car variable.
  • summarise(mean_age = mean(age)) is the function that calculates the mean age for each group of cars. The mean_age variable is the name of the new variable that will contain the mean age for each group.
  • ungroup() is the function that removes the grouping from the data frame. This is optional, but it is good practice to ungroup the data frame after you have finished summarising it.

The output of this code will be a new data frame that contains the mean age for each group of cars. The car variable is the grouping variable, and the mean_age variable is the summary statistic that we calculated for each group.

You can also calculate other summary statistics, such as the median, standard deviation, minimum, and maximum, using the summarise() function. You can also calculate multiple summary statistics at the same time by specifying multiple variables inside the summarise() function. For example, you can calculate the mean and standard deviation of the age variable for each group of cars.

# Group the data frame by the car variable and calculate the mean and standard deviation of the age variable for each group

my_data %>% group_by(car) %>% 
  summarise(mean_age = mean(age), sd_age = sd(age)) %>%
  ungroup()
# A tibble: 2 × 3
  car      mean_age sd_age
  <fct>       <dbl>  <dbl>
1 Electric     36.7  12.6 
2 Petrol       38.3   7.64

In this code, we calculated the mean and standard deviation of the age variable for each group of cars. The mean_age and sd_age variables are the names of the new variables that will contain the mean and standard deviation for each group.