5. Factors

A factor is an object that represents values from some specified set of possible levels. For example, a factor Sex might represent one of two values, "Male" or "Female".

Creating factors

Factors may be created using the factor() function or by converting a character or numeric object using the factor() or as.factor() functions. Factors may also be created by splitting a data object into groups. These methods will be illustrated to create the following factor:

> age

[1] 20-35yrs 20-35yrs 35-55yrs 35-55yrs 20-35yrs 55+yrs 20-35yrs 35-55yrs

1) Using the factor() function:

> age_factor(c(1,1,2,2,1,3,1,2),labels=c("20-35yrs","35-55yrs","55+yrs"))

> age

[1] 20-35yrs 20-35yrs 35-55yrs 35-55yrs 20-35yrs 55+yrs 20-35yrs 35-55yrs

2) Converting a numeric object:

> age_c(1,1,2,2,1,3,1,2)

> age_factor(age,labels=c("20-35yrs","35-55yrs","55+yrs"))

3) Converting a character object:

> age_c("20-35yrs","20-35yrs","35-55yrs","35-55yrs","20-35yrs","55+yrs", "20-35yrs","35-55yrs")

> age_as.factor(age)

The function as.factor() may also be used to convert a numeric object to a factor object, however, it is not be possible to assign labels to the factor levels using the function as.factor().

4) Splitting a data object into groups:

> age_c(22,31,37,52,27,60,34,53)

> age.groups_cut(age, breaks=c(20,35,55,80), labels=c("20-35yrs","35-55yrs","55+ yrs"))

The function cut() creates an object of mode category. Category objects were used by the previous version of S and have now been replaced by factor objects. The category object age.groups can be converted to a factor object using the function factor().

> age_factor(age.groups)

The function cut() may also be used to split a data object into groups of equal width:

> age_c(22,31,37,52,27,60,34,53)

> cut(age, breaks=3)

[1] 1 1 2 3 1 3 1 3
attr(, "levels"):
[1] "Range 1" "Range 2" "Range 3"
A category object is a vector of integers with levels attribute. Level 1 has levels attribute "Range 1", level 2 has levels attribute "Range 2" and so on. When a category object is converted to a factor object, the levels attribute becomes the factor labels.

The function pretty() creates "pretty" break points which can then be used by the function cut() to split the data:

> age_c(22,31,37,52,27,60,34,53)

> cut(age,pretty(age))

[1] 1 2 2 4 1 4 2 4
attr(, "levels"):
[1] "20+ thru 30" "30+ thru 40" "40+ thru 50" "50+ thru 60"

Creating Ordered Factors

Ordered factors are factors whose levels are taken to be ordered.

1) Converting a numeric object:

> age_c(1,1,2,2,1,3,1,2)

> age_ordered(age, levels=c(1,2,3), labels=c("20-35yrs","35-55yrs","55+yrs"))

> age

[1] 20-35yrs 20-35yrs 35-55yrs 35-55yrs 20-35yrs 55+yrs 20-35yrs 35-55yrs 20-35yrs < 35-55yrs < 55+yrs

2) Converting a character object:

> age_c("20-35yrs","20-35yrs","35-55yrs","35-55yrs","20-35yrs","55+yrs", "20-35yrs","35-55yrs")

> ordered(age)_c("20-35yrs","35-55yrs","55+yrs")

3) Splitting a data object into groups:

When a category object is converted to an ordered factor object, the levels attribute is not used as factor labels. In order to keep the label names, the category object must first be converted to a factor object and then to an ordered factor object.

> age_c(22,31,37,52,27,60,34,53)

> age_cut(age,pretty(age))

> age_factor(age)

> age_ordered(age)

> age

[1] 20+ thru 30 30+ thru 40 30+ thru 40 50+ thru 60 20+ thru 30 50+ thru 60

[7] 30+ thru 40 50+ thru 60

20+ thru 30 < 30+ thru 40 < 50+ thru 60

Table

The table() function creates a contingency table.

> table(age)

 20-35yrs 35-55yrs 55+yrs
        4        3      1
Any number of arguments may be given to the table() function:

> sex_factor(c(1,2,2,1,2,1,2,2), labels=c("Female","Male"))

> table(sex, age)

       20-35yrs 35-55yrs 55+yrs
Female        1        1      1
  Male        3        2      0
The function tapply() applies a function to each cell of a table. Suppose we wished to report the mean systolic blood pressure for persons in each of the age/sex groups:

> systol_c(118, 125, 128, 127, 110, 140, 130, 120)

> tapply(systol, list(sex, age), mean)

       20-35yrs 35-55yrs 55+yrs
Female 118.0000      127    140
  Male 121.6667      124     NA
The second argument to the tapply() function gives the indices over which the mean systolic blood pressures are to be calculated.

Further Reading

John M. Chambers, Trevor J.Hastie, Statistical Models in S, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, 1992, pp. 20,21, 52-54.

Richard A. Becker, John M. Chambers, Allan R. Wilks, The New S Language. A Programming Environmnent for Data Analysis and Graphics, Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, California, 1988, pp. 134-138

Where to now?

Table of Contents

Functions