How to select individual rows from duplicates based on the highest median in R? -


i have dataframe containing gene expression data looks following:

row.names     symbol     sample1     sample2     sample3     sample4 probe1        gene1      1.5         2.8         1.8         3.2 probe2        gene2      2.7         4.5         3.2         5.1 probe3        gene3      1.1         4.7         2.3         5.3 probe4        gene2      1.2         0.9         0.8         1.1 probe5        gene1      3.1         6.1         6.2         4.2 

i want subset data unique genes remain, , in each case probe highest median retained i.e. data above become following:

row.names     symbol     sample1     sample2     sample3     sample4 probe2        gene2      2.7         4.5         3.2         5.1 probe3        gene3      1.1         4.7         2.3         5.3 probe5        gene1      3.1         6.1         6.2         4.2 

the dataframe has ~40,000 individual probes , ~100 samples.

does have idea commands in r suitable task?

i wouldn't calculate medians row, rather use vectorized rowmedians function matrixstats package that. then, reorder result , select unique entries using data.table package

library(data.table) library(matrixstats) df$medians <- rowmedians(as.matrix(df[-(1:2)])) unique(setdt(df)[order(-medians)], = "symbol") #    row.names symbol sample1 sample2 sample3 sample4 medians # 1:    probe5  gene1     3.1     6.1     6.2     4.2    5.15 # 2:    probe2  gene2     2.7     4.5     3.2     5.1    3.85 # 3:    probe3  gene3     1.1     4.7     2.3     5.3    3.50 

some benchmarks

library(data.table) library(matrixstats) library(dplyr)  set.seed(123) bigdf <- data.frame(a = paste0("probe", 1:1e5),                     symbol = paste0("gene", sample(1e2, 1e5, replace = true)),                     matrix(sample(1e2, 1e6, replace = true), ncol = 100)) bigdf2 <- copy(bigdf) bigdf3 <- copy(bigdf2)  system.time({   bigdf$medians <- rowmedians(as.matrix(bigdf[-(1:2)]))   unique(setdt(bigdf)[order(-medians)], = "symbol")   })  # user  system elapsed  # 0.22    0.05    0.26   system.time(setdt(bigdf2)[,.sd[which.max(apply(.sd[,-(1:2), = false], 1, median)),], = symbol]) # user  system elapsed  # 5.17    0.01    5.33  system.time({               bigdf3$mediancol <- apply(bigdf3[-(1:2)],1,fun = median)               grouped_df <- group_by(bigdf3,symbol)               filtered_df <- filter(grouped_df, mediancol == max(mediancol)) }) # user  system elapsed  # 5.15    0.00    5.15  

Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -