Wednesday, March 20, 2013

resolving factor frustration in R

Factors in R have two components: their "index" (my term), which is just a vector of integers, and their "level", which is a vector of characters (or integers, though it's most helpful to think of them as just characters for now).  Matloff explains this well in his book The Art of R Programming, but I'll give my own quick example here.

    > f = factor(c(10,11,12,11))
    > f
    [1] 10 11 12 11
    Levels: 10 11 12

    > levels(f)
    [1] "10" "11" "12"

    > unclass(f)
    [1] 1 2 3 2
    attr(,"levels")
    [1] "10" "11" "12"

    > attributes(f)
    $levels
    [1] "10" "11" "12"

    $class
    [1] "factor"

You can see that R's internal representation of factors is just as I explained above.  That is, there is an integer index that is essentially a lookup table with integer indicies that point to the "labels" character vector.  The "labels" character vector attribute is the factor's second component.

While perhaps a bit confusing, this is all relatively straightforward.  Where factors can get really frustrating, however, is when different functions use different parts of the factor; some use the de-referenced "levels" of the factor, whereas others use the index values.  This problem is most evident in the as.character() and as.numeric() functions:

    > as.character(f)
    [1] "10" "11" "12" "11"

    > as.numeric(f)
    [1] 1 2 3 2

What's happening here?  as.character() is dereferencing the index and returning the character levels.  On the other hand, as.numeric() is returning the /index vector/ part of the factor.  This is probably not what you'd expect if you have numeric factors, such as we have above.  When you want to turn your factor into numbers, what you probably want to do is:

    > as.numeric(as.character(f))
    [1] 10 11 12 11

I've been using R for years now, and I finally took the time to look into this and resolve it.  I'm sure I'll run into more unexpected behaviors in the future, but I'm glad to have at least solved this aspect of factor behavior in R.  Do you have any other examples?  Feel free to share them in the comments!