r - quantile on a matrix in long format -


i trying compute quantiles on matrix represented data.table in long format (rowid, colid, value). converting matrix::sparsematrix , computing quantiles. wondering if there more efficient way this? (using r 3.2.1 , data.table 1.9.5 github)

require(data.table) require(matrix)  set.seed(100) nobs <- 1000   #num rows in matrix nvar <- 10    #num columns in matrix density <- .1  #fraction of non-zero values in matrix  nrow <- round(density*nobs*nvar) data.dt <- unique(data.table(obsid=sample(1:nobs,nrow,replace=t),          varid=sample(1:nvar,nrow,replace=t))) data.dt <- data.dt[, value:=runif(.n)]  probs <- c(1,5,10,25,50,75,90,95,100)  #approach 1 system.time({ data.mat <- sparsematrix(i=data.dt[,obsid], j=data.dt[,varid], x=data.dt[,value], dims=c(nobs,nvar)) quantile1.dt <- data.table(t(sapply(1:nvar, function(n) c(n,quantile(data.mat[,n], probs=probs/100, names=false))))) quantile1.dt <- setnames(quantile1.dt, c("varid",sprintf("p%02d",probs)))[order(varid)] })  #approach 2 system.time({ quantile2.dt <- data.dt[, as.list(quantile(c(rep(0,nobs-.n), value), probs=probs/100, names=false)), by=varid] quantile2.dt <- setnames(quantile2.dt, c("varid",sprintf("p%02d",probs)))[order(varid)] })  all.equal(quantile1.dt, quantile2.dt) 

update found answer , wanted share, in case else finds useful! original question approach 1. better way compute same approach 2. real advantage of approach 2 seen when nobs , nvar large. example, when nobs=100,000 , nvar=1,000 approach1 takes 27sec while approach2 takes 4sec!

by description, little hard (for me) see wanted do, i'll make basic example.

set.seed(100) nrow <- 10 ncol <- 5 prop <- 0.1 nobs <- round(prop*nrow*ncol) s1 <- c(5,7,8,8,9) # sample(1:nrow, nobs, replace=t) s2 <- c(1,3,3,4,4) # sample(1:ncol, nobs, replace=t)  # unique pairs arr <- unique(array(c(s1,s2), dim=c(nobs,2)))  # random num each unique pair s3 <- c(0.1, 0.5, 0.8, 0.2, 0.4) # runif(length(arr[,1]))  # show data data.frame(v1=arr[,1], v2=arr[,2], v3=s3)  #   v1 v2  v3 # 1  5  1 0.1 # 2  7  3 0.5 # 3  8  3 0.8 # 4  8  4 0.2 # 5  9  4 0.4 

in case, sparse matrix representation is:

sm <- sparsematrix(i=s1, j=s2, x=s3) # since pairs unique here  # row 1 corresponds s1=1, ..., row 9 corresponds s1=9 # column 1 corresponds s2=1, ... column 4 corresponds s2=4 sm  # [1,] .   . .   .   # [2,] .   . .   . # [3,] .   . .   .   # [4,] .   . .   .   # [5,] 0.1 . .   .   # [6,] .   . .   .   # [7,] .   . 0.5 .   # [8,] .   . 0.8 0.2   # [9,] .   . .   0.4 

the values corresponding s2=1 (0,0,0,0,0.1,0,0,0,0,0)', , on. can find quantiles of each of these columns with:

q <- c(0.25, 0.5, 0.75, 1.0) # quantiles  data.table(t(sapply(1:4, function(n) c(n,quantile(sm[,n], q)))))  #    v1 25% 50% 75% 100% # 1:  1   0   0   0  0.1 # 2:  2   0   0   0  0.0 # 3:  3   0   0   0  0.8 # 4:  4   0   0   0  0.4 

(note here there 9 zeros there should 10. notice if had use 1:ncol in sapply() function above, wouldn't have worked since sm has 4 columns. think using sparsematrix() function quantiles might not work reason)

what fastest way this? suppose variables above s1, s2, s3, nrow, ncol, arr defined above. suppose want quantile of s3 s2 = 1. instance

tmp <- s2==1 quantile( c( s3[tmp], rep(0, nrow-sum(tmp)) ), q) 

this kind of approach potentially better, think large data sets sparsematrix approach should work well


Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -