Understand Basic to Advance Data Structure Used in R to use Efficiently

### Data structures

You’ve probably used many (if not all) of them before, but you may not have thought deeply about how they are interrelated. In this brief overview, I’ll show you how they fit together as a whole. If you need more details, you can find them in R’s documentation.

R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:

**Homogeneous: **Atomic vector (1d), Matrix(2d), Array(nd)

**Heterogeneous:** List, Data frame

Almost all other objects are built upon these foundations. In the OO field guide you’ll see how more complicated objects are built of these simple pieces. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.

Given an object, the best way to understand what data structures it’s composed of is to use `str()`

. `str()`

is short for structure and it gives a compact, human readable description of any R data structure.

Almost all other objects are built upon these foundations. In the OO field guide you’ll see how more complicated objects are built of these simple pieces. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.

Given an object, the best way to understand what data structures it’s composed of is to use `str()`

. `str()`

is short for structure and it gives a compact, human readable description of any R data structure.

### Vectors

The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists. They have three common properties:

- Type,
`typeof()`

, what it is. - Length,
`length()`

, how many elements it contains. - Attributes,
`attributes()`

, additional arbitrary metadata.

They differ in the types of their elements: all elements of an atomic vector must be the same type, whereas the elements of a list can have different types.

NB: `is.vector()`

does not test if an object is a vector. Instead it returns `TRUE`

only if the object is a vector with no attributes apart from names. Use `is.atomic(x) || is.list(x)`

to test if an object is actually a vector.

### Atomic vectors

There are four common types of atomic vectors that I’ll discuss in detail: logical, integer, double (often called numeric), and character. There are two rare types that I will not discuss further: complex and raw.

Atomic vectors are usually created with `c()`

, short for combine:

dbl_var <- c(1, 2.5, 4.5)# With the L suffix, you get an integer rather than a doubleint_var <- c(1L, 6L, 10L)# Use TRUE and FALSE (or T and F) to create logical vectorslog_var <- c(TRUE, FALSE, T, F)chr_var <- c("these are", "some strings")

Atomic vectors are always flat, even if you nest `c()`

’s:

c(1, c(2, c(3, 4)))

## [1] 1 2 3 4

# the same asc(1, 2, 3, 4)

## [1] 1 2 3 4

Missing values are specified with `NA`

, which is a logical vector of length 1. `NA`

will always be coerced to the correct type if used inside `c()`

, or you can create `NA`

s of a specific type with `NA_real_`

(a double vector), `NA_integer_`

and `NA_character_`

.

#### Types and tests

Given a vector, you can determine its type with `typeof()`

, or check if it’s a specific type with an “is” function: `is.character()`

, `is.double()`

, `is.integer()`

, `is.logical()`

, or, more generally, `is.atomic()`

.

int_var <- c(1L, 6L, 10L)typeof(int_var)

## [1] "integer"

is.integer(int_var)

## [1] TRUE

is.atomic(int_var)

## [1] TRUE

dbl_var <- c(1, 2.5, 4.5)typeof(dbl_var)

## [1] "double"

is.double(dbl_var)

## [1] TRUE

is.atomic(dbl_var)

## [1] TRUE

NB: `is.numeric()`

is a general test for the “numberliness” of a vector and returns `TRUE`

for both integer and double vectors. It is not a specific test for double vectors, which are often called numeric.

is.numeric(int_var)

## [1] TRUE

is.numeric(dbl_var)

## [1] TRUE

#### Coercion

All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be **coerced** to the most flexible type. Types from least to most flexible are: logical, integer, double, and character.

For example, combining a character and an integer yields a character:

str(c("a", 1))

## chr [1:2] "a" "1"

Remember the difference between atomic vector and list in R for not messing up when in projects

When a logical vector is coerced to an integer or double, `TRUE`

becomes 1 and `FALSE`

becomes 0. This is very useful in conjunction with `sum()`

and `mean()`

x <- c(FALSE, FALSE, TRUE)as.numeric(x)

## [1] 0 0 1

# Total number of TRUEssum(x)

## [1] 1

# Proportion that are TRUEmean(x)

## [1] 0.3333333

Coercion often happens automatically. Most mathematical functions (`+`

, `log`

, `abs`

, etc.) will coerce to a double or integer, and most logical operations (`&`

, `|`

, `any`

, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information. If confusion is likely, explicitly coerce with `as.character()`

, `as.double()`

, `as.integer()`

, or `as.logical()`

.

### Lists

Lists are different from atomic vectors because their elements can be of any type, including lists. You construct lists by using `list()`

instead of `c()`

:

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))str(x)

## List of 4## $ : int [1:3] 1 2 3## $ : chr "a"## $ : logi [1:3] TRUE FALSE TRUE## $ : num [1:2] 2.3 5.9

Lists are sometimes called **recursive** vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.

x <- list(list(list(list())))str(x)

## List of 1## $ :List of 1## ..$ :List of 1## .. ..$ : list()

is.recursive(x)

## [1] TRUE

`c()`

will combine several lists into one. If given a combination of atomic vectors and lists, `c()`

will coerce the vectors to lists before combining them. Compare the results of `list()`

and `c()`

:

x <- list(list(1, 2), c(3, 4))y <- c(list(1, 2), c(3, 4))str(x)

## List of 2## $ :List of 2## ..$ : num 1## ..$ : num 2## $ : num [1:2] 3 4

str(y)

## List of 4## $ : num 1## $ : num 2## $ : num 3## $ : num 4

The `typeof()`

a list is `list`

. You can test for a list with `is.list()`

and coerce to a list with `as.list()`

. You can turn a list into an atomic vector with `unlist()`

. If the elements of a list have different types, `unlist()`

uses the same coercion rules as `c()`

.

Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by `lm()`

) are lists:

is.list(mtcars)

## [1] TRUE

mod <- lm(mpg ~ wt, data = mtcars)is.list(mod)

## [1] TRUE

### Attributes

All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). Attributes can be accessed individually with `attr()`

or all at once (as a list) with `attributes()`

.

y <- 1:10attr(y, "my_attribute") <- "This is a vector"attr(y, "my_attribute")

## [1] "This is a vector"

str(attributes(y))

## List of 1## $ my_attribute: chr "This is a vector"

The `structure()`

function returns a new object with modified attributes:

structure(1:10, my_attribute = "This is a vector")

## [1] 1 2 3 4 5 6 7 8 9 10## attr(,"my_attribute")## [1] "This is a vector"

By default, most attributes are lost when modifying a vector:

attributes(y[1])

## NULL

attributes(sum(y))

## NULL

### Factors

One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the `class`

, “factor”, which makes them behave differently from regular integer vectors, and the `levels`

, which defines the set of allowed values.

x <- factor(c("a", "b", "b", "a"))x

## [1] a b b a## Levels: a b

class(x)

## [1] "factor"

levels(x)

## [1] "a" "b"

# You can't use values that are not in the levelsx[2] <- "c"

## Warning in `[<-.factor`(`*tmp*`, 2, value = "c"): invalid factor level, NA## generated

x

## [1] a <NA> b a ## Levels: a b

# NB: you can't combine factorsc(factor("a"), factor("b"))

## [1] 1 1

Factors are useful when you know the possible values a variable may take, even if you don’t see all values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:

sex_char <- c("m", "m", "m")sex_factor <- factor(sex_char, levels = c("m", "f"))table(sex_char)

## sex_char## m ## 3

table(sex_factor)

## sex_factor## m f ## 3 0

Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value in the column, often a missing value encoded in a special way like `.`

or `-`

. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the `na.strings`

argument to `read.csv()`

is often a good place to start.

# Reading in "text" instead of from a file here:z <- read.csv(text = "value\n12\n1\n.\n9")typeof(z$value)

## [1] "integer"

as.double(z$value)

## [1] 3 2 1 4

# Oops, that's not right: 3 2 1 4 are the levels of a factor, # not the values we read in!class(z$value)

## [1] "factor"

# We can fix it now:as.double(as.character(z$value))

## Warning: NAs introduced by coercion

## [1] 12 1 NA 9

Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. Instead, use the argument `stringsAsFactors = FALSE`

to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. A global option, `options(stringsAsFactors = FALSE)`

, is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re `source()`

ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave.

While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like `gsub()`

and `grepl()`

) will coerce factors to strings, while others (like `nchar()`

) will throw an error, and still others (like `c()`

) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour. In early versions of R, there was a memory advantage to using factors instead of character vectors, but this is no longer the case.

### Matrices and arrays

Adding a `dim`

attribute to an atomic vector allows it to behave like a multi-dimensional **array**. A special case of the array is the **matrix**, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

Matrices and arrays are created with `matrix()`

and `array()`

, or by using the assignment form of `dim()`

:

# Two scalar arguments to specify rows and columnsa <- matrix(1:6, ncol = 3, nrow = 2)# One vector argument to describe all dimensionsb <- array(1:12, c(2, 3, 2))

# You can also modify an object in place by setting dim()c <- 1:6dim(c) <- c(3, 2)c

## [,1] [,2]## [1,] 1 4## [2,] 2 5## [3,] 3 6

dim(c) <- c(2, 3)c

`## [,1] [,2] [,3]`

and

## [1,] 1 3 5

## [2,] 2 4 6length()`names()`

have high-dimensional generalisations:

`length()`

generalises to`nrow()`

and`ncol()`

for matrices, and`dim()`

for arrays.`names()`

generalises to`rownames()`

and`colnames()`

for matrices, and`dimnames()`

, a list of character vectors, for arrays.

length(a)

## [1] 6

nrow(a)

## [1] 2

ncol(a)

## [1] 3

rownames(a) <- c("A", "B")colnames(a) <- c("a", "b", "c")a

## a b c## A 1 3 5## B 2 4 6

length(b)

## [1] 12

dim(b)

## [1] 2 3 2

dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))b

## , , A## ## a b c## one 1 3 5## two 2 4 6## ## , , B## ## a b c## one 7 9 11## two 8 10 12

`c()`

generalises to `cbind()`

and `rbind()`

for matrices, and to `abind()`

(provided by the `abind`

package) for arrays. You can transpose a matrix with `t()`

; the generalised equivalent for arrays is `aperm()`

.

You can test if an object is a matrix or array using `is.matrix()`

and `is.array()`

, or by looking at the length of the `dim()`

. `as.matrix()`

and `as.array()`

make it easy to turn an existing vector into a matrix or array.

Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (`tapply()`

is a frequent offender). As always, use `str()`

to reveal the differences.

str(1:3) # 1d vector

## int [1:3] 1 2 3

str(matrix(1:3, ncol = 1)) # column vector

## int [1:3, 1] 1 2 3

str(matrix(1:3, nrow = 1)) # row vector

## int [1, 1:3] 1 2 3

str(array(1:3, 3)) # "array" vector

## int [1:3(1d)] 1 2 3

### Data frames

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has `names()`

, `colnames()`

, and `rownames()`

, although `names()`

and `colnames()`

are the same thing. The `length()`

of a data frame is the length of the underlying list and so is the same as `ncol()`

; `nrow()`

gives the number of rows.

As described in subsetting, you can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix).

### Creation

You create a data frame using `data.frame()`

, which takes named vectors as input:

df <- data.frame(x = 1:3, y = c("a", "b", "c"))str(df)

## 'data.frame': 3 obs. of 2 variables:## $ x: int 1 2 3## $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Beware `data.frame()`

’s default behaviour which turns strings into factors. Use `stringsAsFactors = FALSE`

to suppress this behaviour:

df <- data.frame( x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE)str(df)

### Testing and coercion

Because a `data.frame`

is an S3 class, its type reflects the underlying vector used to build it: the list. To check if an object is a data frame, use `class()`

or test explicitly with `is.data.frame()`

:

typeof(df)

## [1] "list"

class(df)

## [1] "data.frame"

is.data.frame(df)

## [1] TRUE

### Combining data frames

You can combine data frames using `cbind()`

and `rbind()`

:

cbind(df, data.frame(z = 3:1))

## x y z## 1 1 a 3## 2 2 b 2## 3 3 c 1

rbind(df, data.frame(x = 10, y = "z"))

## x y## 1 1 a## 2 2 b## 3 3 c## 4 10 z

When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use `plyr::rbind.fill()`

to combine data frames that don’t have the same columns.

It’s a common mistake to try and create a data frame by `cbind()`

ing vectors together. This doesn’t work because `cbind()`

will create a matrix unless one of the arguments is already a data frame. Instead use `data.frame()`

directly:

bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))str(bad)

## 'data.frame': 2 obs. of 2 variables:## $ a: Factor w/ 2 levels "1","2": 1 2## $ b: Factor w/ 2 levels "a","b": 1 2

good <- data.frame(a = 1:2, b = c("a", "b"), stringsAsFactors = FALSE)str(good)

## 'data.frame': 2 obs. of 2 variables:## $ a: int 1 2## $ b: chr "a" "b"

taken from Advanced R by sir Hadley Wickham.

If you like the post and want your colleagues or friends to learn the same hit the like button share it on linkedin ,facebook, twitter let’s grow together for a bias free machine-human compiled future.