library(lsrbook)
9 How does R store data?
You are, I’m sorry to inform you, reading the work-in-progress revision of “Learning Statistics with R”. This chapter is currently a mess, and I don’t recommend reading it.
In this chapter, I’ll talk more about what kinds of variables exist in R, and introduce three new kinds of variables: factors, data frames and formulas. I’ll finish up by talking a little bit about the help documentation in R as well as some other avenues for finding assistance. In general, I’m not trying to be comprehensive in this chapter, I’m trying to make sure that you’ve got the basic foundations needed to tackle the content that comes later in the book. However, a lot of the topics are revisited in more detail later, especially in Chapters @ref(datahandling) and @ref(scripting).
9.1 Factors
Okay, it’s time to start introducing some of the data types that are somewhat more specific to statistics. If you remember back to Chapter @ref(studydesign), when we assign numbers to possible outcomes, these numbers can mean quite different things depending on what kind of variable we are attempting to measure. In particular, we commonly make the distinction between nominal, ordinal, interval and ratio scale data. How do we capture this distinction in R? Currently, we only seem to have a single numeric data type. That’s probably not going to be enough, is it?
A little thought suggests that the numeric variable class in R is perfectly suited for capturing ratio scale data. For instance, if I were to measure response time (RT) for five different events, I could store the data in R like this:
<- c(342, 401, 590, 391, 554) RT
where the data here are measured in milliseconds, as is conventional in the psychological literature. It’s perfectly sensible to talk about “twice the response time”,
2 * RT
[1] 684 802 1180 782 1108
+ 1000 RT
[1] 1342 1401 1590 1391 1554
And to a lesser extent, the “numeric” class is okay for interval scale data, as long as we remember that multiplication and division aren’t terribly interesting for these sorts of variables. That is, if my IQ score is 110 and yours is 120, it’s perfectly okay to say that you’re 10 IQ points smarter than me1, but it’s not okay to say that I’m only 92% as smart as you are, because intelligence doesn’t have a natural zero.2 We might even be willing to tolerate the use of numeric variables to represent ordinal scale variables, such as those that you typically get when you ask people to rank order items (e.g., like we do in Australian elections), though as we will see R actually has a built in tool for representing ordinal data (see Section @ref(orderedfactors)) However, when it comes to nominal scale data, it becomes completely unacceptable, because almost all of the “usual” rules for what you’re allowed to do with numbers don’t apply to nominal scale data. It is for this reason that R has factors.
9.1.1 Introducing factors
Suppose, I was doing a study in which people could belong to one of three different treatment conditions. Each group of people were asked to complete the same task, but each group received different instructions. Not surprisingly, I might want to have a variable that keeps track of what group people were in. So I could type in something like this
<- c(1,1,1,2,2,2,3,3,3) group
so that group[i]
contains the group membership of the i
-th person in my study. Clearly, this is numeric data, but equally obviously this is a nominal scale variable. There’s no sense in which “group 1” plus “group 2” equals “group 3”, but nevertheless if I try to do that, R won’t stop me because it doesn’t know any better:
+ 2 group
[1] 3 3 3 4 4 4 5 5 5
Apparently R seems to think that it’s allowed to invent “group 4” and “group 5”, even though they didn’t actually exist. Unfortunately, R is too stupid to know any better: it thinks that 3
is an ordinary number in this context, so it sees no problem in calculating 3 + 2
. But since we’re not that stupid, we’d like to stop R from doing this. We can do so by instructing R to treat group
as a factor. This is easy to do using the as.factor()
function.3
<- as.factor(group)
group group
[1] 1 1 1 2 2 2 3 3 3
Levels: 1 2 3
It looks more or less the same as before (though it’s not immediately obvious what all that Levels
rubbish is about), but if we ask R to tell us what the class of the group
variable is now, it’s clear that it has done what we asked:
class(group)
[1] "factor"
Neat. Better yet, now that I’ve converted group
to a factor, look what happens when I try to add 2 to it:
+ 2 group
Warning in Ops.factor(group, 2): '+' not meaningful for factors
[1] NA NA NA NA NA NA NA NA NA
This time even R is smart enough to know that I’m being an idiot, so it tells me off and then produces a vector of missing values. (i.e., NA
: see Section @ref(specials)).
9.1.2 Labelling the factor levels
I have a confession to make. My memory is not infinite in capacity; and it seems to be getting worse as I get older. So it kind of annoys me when I get data sets where there’s a nominal scale variable called gender
, with two levels corresponding to males and females. But when I go to print out the variable I get something like this:
gender
[1] 1 1 1 1 1 2 2 2 2
Levels: 1 2
Okaaaay. That’s not helpful at all, and it makes me very sad. Which number corresponds to the males and which one corresponds to the females? Wouldn’t it be nice if R could actually keep track of this? It’s way too hard to remember which number corresponds to which gender. To fix this problem what we need to do is assign meaningful labels to the different levels of each factor. We can do that like this:
levels(group) <- c("group 1", "group 2", "group 3")
print(group)
[1] group 1 group 1 group 1 group 2 group 2 group 2 group 3 group 3 group 3
Levels: group 1 group 2 group 3
levels(gender) <- c("male", "female")
print(gender)
[1] male male male male male female female female female
Levels: male female
That’s much easier on the eye.
9.1.3 Moving on…
Factors are very useful things, and we’ll use them a lot in this book: they’re the main way to represent a nominal scale variable. And there are lots of nominal scale variables out there. I’ll talk more about factors in Section @ref(orderedfactors), but for now you know enough to be able to get started.
9.2 Data frames
It’s now time to go back and deal with the somewhat confusing thing that happened in Section @ref(loadingcsv) when we tried to open up a CSV file. Apparently we succeeded in loading the data, but it came to us in a very odd looking format. At the time, I told you that this was a data frame. Now I’d better explain what that means.
9.2.1 Introducing data frames
Warning in rm(books, keeper, profit, RT, x, y): object 'books' not found
Warning in rm(books, keeper, profit, RT, x, y): object 'keeper' not found
Warning in rm(books, keeper, profit, RT, x, y): object 'profit' not found
Warning in rm(books, keeper, profit, RT, x, y): object 'x' not found
Warning in rm(books, keeper, profit, RT, x, y): object 'y' not found
In order to understand why R has created this funny thing called a data frame, it helps to try to see what problem it solves. So let’s go back to the little scenario that I used when introducing factors in Section @ref(factors). In that section I recorded the group
and gender
for all 9 participants in my study. Let’s also suppose I recorded their ages and their score
on “Dan’s Terribly Exciting Psychological Test”:
<- c(17, 19, 21, 37, 18, 19, 47, 18, 19)
age <- c(12, 10, 11, 15, 16, 14, 25, 21, 29) score
Assuming no other variables are in the workspace, if I type show_environment()
I get this:
show_environment()
# A tibble: 7 × 3
variable class size
<chr> <chr> <chr>
1 age numeric length: 9
2 gender factor length: 9
3 group factor length: 9
4 has_annotations function <NA>
5 hook_output function <NA>
6 score numeric length: 9
7 status function <NA>
So there are four variables in the workspace, age
, gender
, group
and score
. And it just so happens that all four of them are the same size (i.e., they’re all vectors with 9 elements). Aaaand it just so happens that age[1]
corresponds to the age of the first person, and gender[1]
is the gender of that very same person, etc. In other words, you and I both know that all four of these variables correspond to the same data set, and all four of them are organised in exactly the same way.
However, R doesn’t know this! As far as it’s concerned, there’s no reason why the age
variable has to be the same length as the gender
variable; and there’s no particular reason to think that age[1]
has any special relationship to gender[1]
any more than it has a special relationship to gender[4]
. In other words, when we store everything in separate variables like this, R doesn’t know anything about the relationships between things. It doesn’t even really know that these variables actually refer to a proper data set. The data frame fixes this: if we store our variables inside a data frame, we’re telling R to treat these variables as a single, fairly coherent data set.
To see how they do this, let’s create one. So how do we create a data frame? One way we’ve already seen: if we import our data from a CSV file, R will store it as a data frame. A second way is to create it directly from some existing variables using the data.frame()
function. All you have to do is type a list of variables that you want to include in the data frame. The output of a data.frame()
command is, well, a data frame. So, if I want to store all four variables from my experiment in a data frame called expt
I can do so like this:
<- data.frame ( age, gender, group, score )
expt expt
age gender group score
1 17 male group 1 12
2 19 male group 1 10
3 21 male group 1 11
4 37 male group 2 15
5 18 male group 2 16
6 19 female group 2 14
7 47 female group 3 25
8 18 female group 3 21
9 19 female group 3 29
Note that expt
is a completely self-contained variable. Once you’ve created it, it no longer depends on the original variables from which it was constructed. That is, if we make changes to the original age
variable, it will not lead to any changes to the age data stored in expt
.
9.2.2 Pulling out the contents of the data frame using $
At this point, our workspace contains only the one variable, a data frame called expt
. But as we can see when we told R to print the variable out, this data frame contains 4 variables, each of which has 9 observations. So how do we get this information out again? After all, there’s no point in storing information if you don’t use it, and there’s no way to use information if you can’t access it. So let’s talk a bit about how to pull information out of a data frame.
The first thing we might want to do is pull out one of our stored variables, let’s say score
. One thing you might try to do is ignore the fact that score
is locked up inside the expt
data frame. For instance, you might try to print it out like this:
score
Error in eval(expr, envir, enclos): object 'score' not found
This doesn’t work, because R doesn’t go “peeking” inside the data frame unless you explicitly tell it to do so. There’s actually a very good reason for this, which I’ll explain in a moment, but for now let’s just assume R knows what it’s doing. How do we tell R to look inside the data frame? As is always the case with R there are several ways. The simplest way is to use the $
operator to extract the variable you’re interested in, like this:
$score expt
[1] 12 10 11 15 16 14 25 21 29
9.2.3 Getting information about a data frame
One problem that sometimes comes up in practice is that you forget what you called all your variables. Normally you might try to type objects()
or show_environment()
, but neither of those commands will tell you what the names are for those variables inside a data frame! One way is to ask R to tell you what the names of all the variables stored in the data frame are, which you can do using the names()
function:
names(expt)
[1] "age" "gender" "group" "score"
9.2.4 Looking for more on data frames?
There’s a lot more that can be said about data frames: they’re fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We’ll talk a lot more about them in Chapter @ref(datahandling).
9.3 Lists
The next kind of data I want to mention are lists. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. I don’t use lists very often in this book – not directly – but most of the advanced data structures in R are built from lists (e.g., data frames are actually a specific type of list). Because lists are so important to how R stores things, it’s useful to have a basic understanding of them. Okay, so what is a list, exactly? Like data frames, lists are just “collections of variables.” However, unlike data frames – which are basically supposed to look like a nice “rectangular” table of data – there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another. In order to understand what this actually means, the best thing to do is create a list, which we can do using the list()
function. If I type this as my command:
<- list( age = 34,
Dan nerd = TRUE,
parents = c("Joe","Liz")
)
R creates a new list variable called Dan
, which is a bundle of three different variables: age
, nerd
and parents
. Notice, that the parents
variable is longer than the others. This is perfectly acceptable for a list, but it wouldn’t be for a data frame. If we now print out the variable, you can see the way that R stores the list:
print( Dan )
$age
[1] 34
$nerd
[1] TRUE
$parents
[1] "Joe" "Liz"
As you might have guessed from those $
symbols everywhere, the variables are stored in exactly the same way that they are for a data frame (again, this is not surprising: data frames are a type of list). So you will (I hope) be entirely unsurprised and probably quite bored when I tell you that you can extract the variables from the list using the $
operator, like so:
$nerd Dan
[1] TRUE
If you need to add new entries to the list, the easiest way to do so is to again use $
, as the following example illustrates. If I type a command like this
$children <- "Alex" Dan
then R creates a new entry to the end of the list called children
, and assigns it a value of "Alex"
. If I were now to print()
this list out, you’d see a new entry at the bottom of the printout. Finally, it’s actually possible for lists to contain other lists, so it’s quite possible that I would end up using a command like Dan$children$age
to find out how old my son is. Or I could try to remember it myself I suppose.
9.4 Formulas
The last kind of variable that I want to introduce before finally being able to start talking about statistics is the formula. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (see Chapter @ref(regression)) but they’re such handy things that they’ve spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.
Stated simply, a formula object is a variable, but it’s a special type of variable that specifies a relationship between other variables. A formula is specified using the “tilde operator” ~
. A very simple example of a formula is shown below:4
<- out ~ pred
formula1 formula1
out ~ pred
The precise meaning of this formula depends on exactly what you want to do with it, but in broad terms it means “the out
(outcome) variable, analysed in terms of the pred
(predictor) variable”. That said, although the simplest and most common form of a formula uses the “one variable on the left, one variable on the right” format, there are others. For instance, the following examples are all reasonably common
<- out ~ pred1 + pred2 # more than one variable on the right
formula2 <- out ~ pred1 * pred2 # different relationship between predictors
formula3 <- ~ var1 + var2 # a 'one-sided' formula formula4
and there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do.
9.5 Generic functions
There’s one really important thing that I omitted when I discussed functions earlier on in Section @ref(usingfunctions), and that’s the concept of a generic function. The two most notable examples that you’ll see in the next few chapters are summary()
and plot()
, although you’ve already seen an example of one working behind the scenes, and that’s the print()
function. The thing that makes generics different from the other functions is that their behaviour changes, often quite dramatically, depending on the class()
of the input you give it. The easiest way to explain the concept is with an example. With that in mind, lets take a closer look at what the print()
function actually does. I’ll do this by creating a formula, and printing it out in a few different ways. First, let’s stick with what we know:
<- blah ~ blah.blah # create a variable of class "formula"
my.formula print( my.formula ) # print it out using the generic print() function
blah ~ blah.blah
So far, there’s nothing very surprising here. But there’s actually a lot going on behind the scenes here. When I type print( my.formula )
, what actually happens is the print()
function checks the class of the my.formula
variable. When the function discovers that the variable it’s been given is a formula, it goes looking for a function called print.formula()
, and then delegates the whole business of printing out the variable to the print.formula()
function.5 For what it’s worth, the name for a “dedicated” function like print.formula()
that exists only to be a special case of a generic function like print()
is a method, and the name for the process in which the generic function passes off all the hard work onto a method is called method dispatch. You won’t need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we’ll use are actually generics. Anyway, to help expose a little more of the workings to you, let’s bypass the print()
function entirely and call the formula method directly:
print.formula( my.formula ) # print it out using the print.formula() method
## Appears to be deprecated
There’s no difference in the output at all. But this shouldn’t surprise you because it was actually the print.formula()
method that was doing all the hard work in the first place. The print()
function itself is a lazy bastard that doesn’t do anything other than select which of the methods is going to do the actual printing.
Okay, fair enough, but you might be wondering what would have happened if print.formula()
didn’t exist? That is, what happens if there isn’t a specific method defined for the class of variable that you’re using? In that case, the generic function passes off the hard work to a “default” method, whose name in this case would be print.default()
. Let’s see what happens if we bypass the print()
formula, and try to print out my.formula
using the print.default()
function:
print.default( my.formula ) # print it out using the print.default() method
blah ~ blah.blah
attr(,"class")
[1] "formula"
attr(,".Environment")
<environment: R_GlobalEnv>
Hm. You can kind of see that it is trying to print out the same formula, but there’s a bunch of ugly low-level details that have also turned up on screen. This is because the print.default()
method doesn’t know anything about formulas, and doesn’t know that it’s supposed to be hiding the obnoxious internal gibberish that R produces sometimes.
At this stage, this is about as much as we need to know about generic functions and their methods. In fact, you can get through the entire book without learning any more about them than this, so it’s probably a good idea to end this discussion here.
9.6 Summary
- Useful things to know about variables. In particular, we talked about special values, element names and classes.
- More complex types of variables. R has a number of important variable types that will be useful when analysing real data. I talked about factors in Section @ref(factors), data frames in Section @ref(dataframes), lists in Section @ref(lists) and formulas in Section @ref(formulas).
- Generic functions. How is it that some function seem to be able to do lots of different things? Section @ref(generics) tells you how.
Taking all the usual caveats that attach to IQ measurement as a given, of course.↩︎
Or, more precisely, we don’t know how to measure it. Arguably, a rock has zero intelligence. But it doesn’t make sense to say that the IQ of a rock is 0 in the same way that we can say that the average human has an IQ of 100. And without knowing what the IQ value is that corresponds to a literal absence of any capacity to think, reason or learn, then we really can’t multiply or divide IQ scores and expect a meaningful answer.↩︎
Once again, this is an example of coercing a variable from one class to another. I’ll talk about coercion in more detail in Section @ref(coercion).↩︎
Note that, when I write out the formula, R doesn’t check to see if the
out
andpred
variables actually exist: it’s only later on when you try to use the formula for something that this happens.↩︎For readers with a programming background: what I’m describing is the very basics of how S3 methods work. However, you should be aware that R has two entirely distinct systems for doing object oriented programming, known as S3 and S4. Of the two, S3 is simpler and more informal, whereas S4 supports all the stuff that you might expect of a fully object oriented language. Most of the generics we’ll run into in this book use the S3 system, which is convenient for me because I’m still trying to figure out S4. ↩︎