parallel processing - bootstrap a dataset in R -


i need perform bootstrapping dataset in r. data in form of list contains 2 matrices , has following properties:

  • both matrices n m , contain positive integers (including 0).

    data <- list(a=matrix(,n,m), b=matrix(,n,m)) 
  • a number of marbles, 10000 distributed each matrix, i.e., 10000 divided in n*m parts. in other words, sum of entries each matrix fixed.

    > sum(data$a) [1] 10000 > sum(data$b) [1] 10000 
  • the marbles distributed according affinity of ij-th elements marbles, i.e. how many marbles end ij-th entry of matrix depends on probability associated every cell of matrix.
  • the probabilities associated elements different 2 matrices.

my goal estimate parameters lead underlying probabilities. model assumes 2n parameters, n number of rows , 1 set each matrix. parameters combine in complex manner , 2 matrices must analyzed together.

    parameters <- data.frame(a=numeric(n), b=numeric(n))  

right now, approach using:

  1. i define function sgen takes input matrix containing probabilities associated sites, generates dataset using these probabilities , returns it.

    sgen <- function(freq) {    #generate sample    ... } 
  2. for non-parametric bootstrap (which want implement now), run experiment, , calculate observed probability associated each ij element dividing observed matrices 10000. let call freq now. so, freq list 2 matrices.

    freq <- list(a=data$a/10000, b=data$b/10000) 
  3. next, replicate 100 samples data passing freq sgen.
  4. i pass replicates pre-defined function, analyze gives me 100 n 2 matrices containing parameters.
  5. next calculate mean , sd of entries between matrices n 2 matrix containing means , containing sd. so, desired value (1,5)th element of mean matrix mean of (1,5)th elements of 100 replicates.

while approach works, use boot package in r job. want because can use functions in boot package later analyses , way essential information stored in format of boot class. important reason use boot package offers easy way make use of multicore capabilities of computer. so, can please guide me on how use boot purpose?

you can use bootstrap function in following way (taken ?bootstrap):

# bootstrap functions of more complex data structures,  # write theta argument x # set of observation numbers   # ,  pass data bootstrap vector 1,2,..n.  # example, bootstrap # correlation coefficient set of 15 data pairs: xdata <- matrix(rnorm(30),ncol=2) n <- 15 theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) } results <- bootstrap(1:n,20,theta,xdata) 

theta function bootstrap.

the problem approach (i believe) theta can return vector (not dataframe/matrix of multiple values in 1 go). so, if theta function returns else vector might not work.

update boot package:

the approach similar using boot function boot package. takes data, data vector, matrix, or dataframe, , statistic, "a function when applied data returns vector containing statistic(s) of interest." non-parametric bootstrap, statistic function must take (at least) 2 arguments: original data, , vector of indices, frequencies or weights.

so, key write 1 function implements steps 1-5 on subset of data given index, e.g:

theta <- function(data, indices) {     ## exact subsetting operation depends on format of data     subset_data = data[indices,]     ## perform calculations in steps 1-5 here on subset_data } 

then should able call theta this:

boot(data, theta) 

Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -