r - quantile on a matrix in long format -
i trying compute quantiles on matrix represented data.table in long format (rowid, colid, value). converting matrix::sparsematrix , computing quantiles. wondering if there more efficient way this? (using r 3.2.1 , data.table 1.9.5 github)
require(data.table) require(matrix) set.seed(100) nobs <- 1000 #num rows in matrix nvar <- 10 #num columns in matrix density <- .1 #fraction of non-zero values in matrix nrow <- round(density*nobs*nvar) data.dt <- unique(data.table(obsid=sample(1:nobs,nrow,replace=t), varid=sample(1:nvar,nrow,replace=t))) data.dt <- data.dt[, value:=runif(.n)] probs <- c(1,5,10,25,50,75,90,95,100) #approach 1 system.time({ data.mat <- sparsematrix(i=data.dt[,obsid], j=data.dt[,varid], x=data.dt[,value], dims=c(nobs,nvar)) quantile1.dt <- data.table(t(sapply(1:nvar, function(n) c(n,quantile(data.mat[,n], probs=probs/100, names=false))))) quantile1.dt <- setnames(quantile1.dt, c("varid",sprintf("p%02d",probs)))[order(varid)] }) #approach 2 system.time({ quantile2.dt <- data.dt[, as.list(quantile(c(rep(0,nobs-.n), value), probs=probs/100, names=false)), by=varid] quantile2.dt <- setnames(quantile2.dt, c("varid",sprintf("p%02d",probs)))[order(varid)] }) all.equal(quantile1.dt, quantile2.dt)
update found answer , wanted share, in case else finds useful! original question approach 1. better way compute same approach 2. real advantage of approach 2 seen when nobs , nvar large. example, when nobs=100,000 , nvar=1,000 approach1 takes 27sec while approach2 takes 4sec!
by description, little hard (for me) see wanted do, i'll make basic example.
set.seed(100) nrow <- 10 ncol <- 5 prop <- 0.1 nobs <- round(prop*nrow*ncol) s1 <- c(5,7,8,8,9) # sample(1:nrow, nobs, replace=t) s2 <- c(1,3,3,4,4) # sample(1:ncol, nobs, replace=t) # unique pairs arr <- unique(array(c(s1,s2), dim=c(nobs,2))) # random num each unique pair s3 <- c(0.1, 0.5, 0.8, 0.2, 0.4) # runif(length(arr[,1])) # show data data.frame(v1=arr[,1], v2=arr[,2], v3=s3) # v1 v2 v3 # 1 5 1 0.1 # 2 7 3 0.5 # 3 8 3 0.8 # 4 8 4 0.2 # 5 9 4 0.4
in case, sparse matrix representation is:
sm <- sparsematrix(i=s1, j=s2, x=s3) # since pairs unique here # row 1 corresponds s1=1, ..., row 9 corresponds s1=9 # column 1 corresponds s2=1, ... column 4 corresponds s2=4 sm # [1,] . . . . # [2,] . . . . # [3,] . . . . # [4,] . . . . # [5,] 0.1 . . . # [6,] . . . . # [7,] . . 0.5 . # [8,] . . 0.8 0.2 # [9,] . . . 0.4
the values corresponding s2=1
(0,0,0,0,0.1,0,0,0,0,0)'
, , on. can find quantiles of each of these columns with:
q <- c(0.25, 0.5, 0.75, 1.0) # quantiles data.table(t(sapply(1:4, function(n) c(n,quantile(sm[,n], q))))) # v1 25% 50% 75% 100% # 1: 1 0 0 0 0.1 # 2: 2 0 0 0 0.0 # 3: 3 0 0 0 0.8 # 4: 4 0 0 0 0.4
(note here there 9 zeros there should 10. notice if had use 1:ncol
in sapply()
function above, wouldn't have worked since sm
has 4 columns. think using sparsematrix()
function quantiles might not work reason)
what fastest way this? suppose variables above s1, s2, s3, nrow, ncol, arr
defined above. suppose want quantile of s3
s2 = 1
. instance
tmp <- s2==1 quantile( c( s3[tmp], rep(0, nrow-sum(tmp)) ), q)
this kind of approach potentially better, think large data sets sparsematrix
approach should work well
Comments
Post a Comment