How to select individual rows from duplicates based on the highest median in R? -
i have dataframe containing gene expression data looks following:
row.names symbol sample1 sample2 sample3 sample4 probe1 gene1 1.5 2.8 1.8 3.2 probe2 gene2 2.7 4.5 3.2 5.1 probe3 gene3 1.1 4.7 2.3 5.3 probe4 gene2 1.2 0.9 0.8 1.1 probe5 gene1 3.1 6.1 6.2 4.2
i want subset data unique genes remain, , in each case probe highest median retained i.e. data above become following:
row.names symbol sample1 sample2 sample3 sample4 probe2 gene2 2.7 4.5 3.2 5.1 probe3 gene3 1.1 4.7 2.3 5.3 probe5 gene1 3.1 6.1 6.2 4.2
the dataframe has ~40,000 individual probes , ~100 samples.
does have idea commands in r suitable task?
i wouldn't calculate medians row, rather use vectorized rowmedians
function matrixstats
package that. then, reorder result , select unique entries using data.table
package
library(data.table) library(matrixstats) df$medians <- rowmedians(as.matrix(df[-(1:2)])) unique(setdt(df)[order(-medians)], = "symbol") # row.names symbol sample1 sample2 sample3 sample4 medians # 1: probe5 gene1 3.1 6.1 6.2 4.2 5.15 # 2: probe2 gene2 2.7 4.5 3.2 5.1 3.85 # 3: probe3 gene3 1.1 4.7 2.3 5.3 3.50
some benchmarks
library(data.table) library(matrixstats) library(dplyr) set.seed(123) bigdf <- data.frame(a = paste0("probe", 1:1e5), symbol = paste0("gene", sample(1e2, 1e5, replace = true)), matrix(sample(1e2, 1e6, replace = true), ncol = 100)) bigdf2 <- copy(bigdf) bigdf3 <- copy(bigdf2) system.time({ bigdf$medians <- rowmedians(as.matrix(bigdf[-(1:2)])) unique(setdt(bigdf)[order(-medians)], = "symbol") }) # user system elapsed # 0.22 0.05 0.26 system.time(setdt(bigdf2)[,.sd[which.max(apply(.sd[,-(1:2), = false], 1, median)),], = symbol]) # user system elapsed # 5.17 0.01 5.33 system.time({ bigdf3$mediancol <- apply(bigdf3[-(1:2)],1,fun = median) grouped_df <- group_by(bigdf3,symbol) filtered_df <- filter(grouped_df, mediancol == max(mediancol)) }) # user system elapsed # 5.15 0.00 5.15
Comments
Post a Comment