Union Multiple Data.Frames with Different Column Names

  • 时间: 2018-09-23 06:25:06

(This article was first published on S+/R – Yet Another Blog in Statistical Computing , and kindly contributed toR-bloggers)

On Friday, while working on a project that I needed to union multiple data.frames with different column names, I realized that the base::rbind() function doesn’t take data.frames with different columns names and therefore just quickly drafted a rbind2() function on the fly to get the job done based on the idea of MapReduce that I discussed before ( https://statcompute.wordpress.com/2018/09/08/playing-map-and-reduce-in-r-subsetting ).

rbind2 <- function(lst) {  h <- unique(unlist(lapply(lst, names)))  Reduce(rbind, parallel::mcMap(function(x) {x[, setdiff(h, names(x))] <- NA; return(x)}, lst, mc.cores = length(lst)))}

On Saturday, when I revisited the problem, I found a very good thread on the stackoverflow ( https://stackoverflow.com/questions/3402371/combine-two-data-frames-by-rows-rbind-when-they-have-different-sets-of-columns ) discussing various approaches addressing my problem yesterday. Out of curiosity, I did a comparison between the rbind2() and discussed approaches by combining 8 data.frames each with a million records. As shown in the plot, my homebrew rbind2() function is only marginally faster than the gtools::smartbind() function and the dplyr::bind_rows function is the most efficient.

n <- 1000000d1 <- data.frame(id = 1:n, x1 = 1)d2 <- data.frame(id = 1:n, x2 = 2)d3 <- data.frame(id = 1:n, x3 = 3)d4 <- data.frame(id = 1:n, x4 = 4)d5 <- data.frame(id = 1:n, x5 = 5)d6 <- data.frame(id = 1:n, x6 = 6)d7 <- data.frame(id = 1:n, x7 = 7)d8 <- data.frame(id = 1:n, x8 = 8)microbenchmark::microbenchmark(times = 10,   "homebrew::rbind2"      = {rbind2(list(d1, d2, d3, d4, d5, d6, d7, d8))},  "gtools::smartbind"     = {gtools::smartbind(list = list(d1, d2, d3, d4, d5, d6, d7, d8))},  "dplyr::bind_rows"      = {dplyr::bind_rows(d1, d2, d3, d4, d5, d6, d7, d8)},  "plyr::rbind.fill"      = {plyr::rbind.fill(d1, d2, d3, d4, d5, d6, d7, d8)},  "data.table::rbindlist" = {data.table::rbindlist(list(d1, d2, d3, d4, d5, d6, d7, d8), fill = T)})