3.1 Indexing and subsetting

We can use numeric indexes and the function subset() in order to extract parts of a vector, matrix, or data frame, generating a different version of the original data set. For instance, we could be interested in analyzing the psychometric properties of an ability test, focusing only on the specific items of a scale. Likewise, we could be interested in plotting and modeling the data of a subgroup of participants. Sometimes, it is recommended to rename variables or to transform them into new ones by recoding them. In all these operations, we need to specify which rows (observations, cases, participants) and columns (the variables of interest) will be selected, deleted, transformed, renamed, or recoded.

3.1.1 Selecting columns (variables)

There are different ways to select one vector column or variable.

We can use the dollar symbol ($) between the data set (placed before the dollar symbol) and the name of the variable that we want to select (placed after the dollar symbol).
We can use square brackets after the data set's name. Inside of the brackets we can either include the numerical value showing in which location that particular variable is placed in the data set (e.g., a number 4 will mean that we want to select the fourth variable from the data set) or the name of the variable within quotation marks (e.g., 'sex').
We can use the classic indexing approach to matrices with square brackets. We will select all rows (i.e., in the first dimension or rows, we will leave a blank space before the comma) and the specific vector column after the comma (e.g., in the second dimension or columns, a 4 will select the data from the fourth column; i.e., the variable sex).


affective.dis$sex
affective.dis[4]
affective.dis['sex']
affective.dis[ , 4]

Sometimes, we are interested in selecting more than one variable. To do so, we can use the function c() with the indexes or names of the variables. To drop one or more variables we can use the minus sign before the indexes.


head(affective.dis[ , c(1, 2, 4, 5, 7)])
##   id treatment    sex depression alone
## 1  1         1   male         22     0
## 2  2         1 female         19     0
## 3  3         1   male         14     1
## 4  4         1 female         23     0
## 5  5         1   male         16     1
## 6  6         1 female         11     1

head(affective.dis[ , -7])
##   id treatment year    sex depression life.satis
## 1  1         1 2020   male         22   53.50930
## 2  2         1 2020 female         19   42.10586
## 3  3         1 2020   male         14   22.73856
## 4  4         1 2020 female         23   18.42421
## 5  5         1 2020   male         16   68.70935
## 6  6         1 2020 female         11   40.22030

3.1.2 Selecting rows (observations)

To select observations, cases, or participants (i.e., rows) we will use the approach for indexing matrices. This time, the blank space will be left after the comma (i.e., columns) and we will select the observations by indexing the space before the comma (i.e., rows).


affective.dis[12:15, ]
##    id treatment year    sex depression life.satis alone
## 12 12         1 2020 female         22   48.86415     0
## 13 13         2 2020   male         11   31.99603     0
## 14 14         2 2020 female          8   29.28087     0
## 15 15         2 2020   male          5   64.95737     0

affective.dis[c(1, 18, 36), ]
##    id treatment year    sex depression life.satis alone
## 1   1         1 2020   male         22   53.50930     0
## 18 18         2 2020 female         10   31.53965     0
## 36 36         3 2020 female          4   41.11086     0

3.1.3 Selecting rows and columns (conditional subsetting)

Although we could use a matrix indexing procedure to select both observations and variables, the use of logical operators is a more efficient approach. For instance, we could be interested in selecting a subset of subjects that were told to exercise daily to treat their depression (treatment = 1).


affective.dis[affective.dis$treatment == 1, ]
affective.dis[affective.dis['treatment'] == 1, ]

We could complicate things further by selecting a subset of participants scoring 42 or more in the scale Satisfaction with life (life.satis $\geq$ 42) and living alone (alone = 1).


affective.dis[affective.dis$life.satis >= 42
             & affective.dis$alone == 1, ]
##    id treatment year    sex depression life.satis alone
## 5   5         1 2020   male         16   68.70935     1
## 8   8         1 2020 female         19   44.29188     1
## 9   9         1 2020   male         17   50.12402     1
## 10 10         1 2020 female         14   54.88683     1
## 16 16         2 2020 female          7   52.87438     1
## 20 20         2 2020 female          7   49.12089     1
## 21 21         2 2020   male         16   61.50143     1

Logical operators

x is less than y: x < y
x is less than or equal to y: x <= y
x is greater than y: x > y
x is greater than or equal to y: x >= y
x is equal to y: x == y
Not equal to: !=
Not x: !x
x AND y: x & y
x OR y: x | y