3.1 Indexing and subsetting
We can use numeric indexes and the function subset()
in order to extract parts of a vector, matrix, or data frame, generating a different version of the original data set. For instance, we could be interested in analyzing the psychometric properties of an ability test, focusing only on the specific items of a scale. Likewise, we could be interested in plotting and modeling the data of a subgroup of participants. Sometimes, it is recommended to rename variables or to transform them into new ones by recoding them. In all these operations, we need to specify which rows (observations, cases, participants) and columns (the variables of interest) will be selected, deleted, transformed, renamed, or recoded.
3.1.1 Selecting columns (variables)
There are different ways to select one vector column or variable.
- We can use the dollar symbol (
$
) between the data set (placed before the dollar symbol) and the name of the variable that we want to select (placed after the dollar symbol). - We can use square brackets after the data set's name. Inside of the brackets we can either include the numerical value showing in which location that particular variable is placed in the data set (e.g., a number
4
will mean that we want to select the fourth variable from the data set) or the name of the variable within quotation marks (e.g.,'sex'
). - We can use the classic indexing approach to matrices with square brackets. We will select all rows (i.e., in the first dimension or rows, we will leave a blank space before the comma) and the specific vector column after the comma (e.g., in the second dimension or columns, a
4
will select the data from the fourth column; i.e., the variable sex).
$sex
affective.dis4]
affective.dis['sex']
affective.dis[4] affective.dis[ ,
Sometimes, we are interested in selecting more than one variable. To do so, we can use the function c()
with the indexes or names of the variables. To drop one or more variables we can use the minus sign before the indexes.
head(affective.dis[ , c(1, 2, 4, 5, 7)])
## id treatment sex depression alone
## 1 1 1 male 22 0
## 2 2 1 female 19 0
## 3 3 1 male 14 1
## 4 4 1 female 23 0
## 5 5 1 male 16 1
## 6 6 1 female 11 1
head(affective.dis[ , -7])
## id treatment year sex depression life.satis
## 1 1 1 2020 male 22 53.50930
## 2 2 1 2020 female 19 42.10586
## 3 3 1 2020 male 14 22.73856
## 4 4 1 2020 female 23 18.42421
## 5 5 1 2020 male 16 68.70935
## 6 6 1 2020 female 11 40.22030
3.1.2 Selecting rows (observations)
To select observations, cases, or participants (i.e., rows) we will use the approach for indexing matrices. This time, the blank space will be left after the comma (i.e., columns) and we will select the observations by indexing the space before the comma (i.e., rows).
12:15, ]
affective.dis[## id treatment year sex depression life.satis alone
## 12 12 1 2020 female 22 48.86415 0
## 13 13 2 2020 male 11 31.99603 0
## 14 14 2 2020 female 8 29.28087 0
## 15 15 2 2020 male 5 64.95737 0
c(1, 18, 36), ]
affective.dis[## id treatment year sex depression life.satis alone
## 1 1 1 2020 male 22 53.50930 0
## 18 18 2 2020 female 10 31.53965 0
## 36 36 3 2020 female 4 41.11086 0
3.1.3 Selecting rows and columns (conditional subsetting)
Although we could use a matrix indexing procedure to select both observations and variables, the use of logical operators is a more efficient approach. For instance, we could be interested in selecting a subset of subjects that were told to exercise daily to treat their depression (treatment = 1
).
$treatment == 1, ]
affective.dis[affective.dis'treatment'] == 1, ] affective.dis[affective.dis[
We could complicate things further by selecting a subset of participants scoring 42 or more in the scale Satisfaction with life (life.satis
\(\geq\) 42
) and living alone (alone = 1
).
$life.satis >= 42
affective.dis[affective.dis& affective.dis$alone == 1, ]
## id treatment year sex depression life.satis alone
## 5 5 1 2020 male 16 68.70935 1
## 8 8 1 2020 female 19 44.29188 1
## 9 9 1 2020 male 17 50.12402 1
## 10 10 1 2020 female 14 54.88683 1
## 16 16 2 2020 female 7 52.87438 1
## 20 20 2 2020 female 7 49.12089 1
## 21 21 2 2020 male 16 61.50143 1
Logical operators
-
x is less than y:
x < y
-
x is less than or equal to y:
x <= y
-
x is greater than y:
x > y
-
x is greater than or equal to y:
x >= y
-
x is equal to y:
x == y
-
Not equal to:
!=
-
Not x:
!x
-
x AND y:
x & y
-
x OR y:
x | y