Tuesday, October 23, 2012

Importing Excel data in R with XLConnect

I recently had to migrate to the 64-bit version of R due to the 4 GB memory limit imposed by the 32-bit version.  In doing so, the package I was using for importing Excel files (xlsReadWrite) was rendered incompatible.  Thus, I've moved over to XLConnect: http://cran.r-project.org/package=XLConnect

XLConnect looks very nice: it is fully featured, and is implemented as objects.  It uses a Java library to do the Excel imports, apparently.  This would seem like a detail, but it ends up adding a very important wrinkle.  According to the developer, Java's "ToString" function adds a ".0" to any values that look numeric.  So, even if you tell XLConnect's readWorksheet function to read the column in as a character data type (using the colTypes argument), it will add a ".0" to every value that looks to Java like a number.

Here's an example.  Let's make an excel document called XLConnect_test.xlsx:



Now, let's try to import it via XLConnect:
> require(XLConnect)
> xlc_df = readWorksheetFromFile(file="XLConnect_test.xlsx", sheet=1)
> xlc_df

    Col1 Col2
1    ABC  1.0
2  123.0  2.0
3 123ABC  ABC

> xlc_df = readWorksheetFromFile(file="XLConnect_test.xlsx", sheet=1, colTypes=c("character", "character"))
> xlc_df

    Col1 Col2
1    ABC  1.0
2  123.0  2.0
3 123ABC  ABC
Notice how XLConnect adds ".0" to each value, even though those columns are supposed to be character strings?  That's not good.  The nicely-communicative developer says that they're working on this for a future version.  In the meantime, here's the hack I've written so the rest of us can move forward:
xlc_dot_zero_hack <- function(df) {
    # looks for all numerics that had a ".0" added to them by XLConnect, and removes the ".0"
        return(
            as.data.frame( lapply( df, FUN = function(x) sub("^(\\d*)\\.0$", "\\1", x, perl=TRUE) ) )
        )
}
Just import your Excel data, and then run the function above on your imported dataframe.  Don't forget to reassign the result to a dataframe.  In our example, we would write:
> xlc_df = xlc_dot_zero_hack(xlc_df)
> xlc_df

    Col1 Col2
1    ABC    1
2    123    2
3 123ABC  ABC
Voilà!  Note that if your data has any strings that *actually* have a ".0" at the end, this hack will have unexpected results.  Specifically, it will remove *all* ".0"s that come after a string of digits.

I hope this might help others out there with the same conundrum!


Followers