apache spark - mapPartitions returns empty array -


i have following rdd has 4 partitions:-

val rdd=sc.parallelize(1 20,4) 

now try call mappartitions on this:-

scala> rdd.mappartitions(x=> { println(x.size); x }).collect 5 5 5 5 res98: array[int] = array() 

why return empty array? anonymoys function returning same iterator received, how returning empty array? interesting part if remove println statement, indeed returns non empty array:-

scala> rdd.mappartitions(x=> { x }).collect res101: array[int] = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20) 

this don't understand. how come presence of println (which printing size of iterator) affecting final outcome of function?

that's because x traversableonce, means traversed calling size , returned back....empty.

you work around number of ways, here one:

rdd.mappartitions(x=> {   val list = x.tolist;   println(list.size);   list.toiterator }).collect 

Comments

Popular posts from this blog

php - Admin SDK -- get information about the group -

dns - How To Use Custom Nameserver On Free Cloudflare? -

Python Error - TypeError: input expected at most 1 arguments, got 3 -