apache spark - mapPartitions returns empty array -
i have following rdd has 4 partitions:-
val rdd=sc.parallelize(1 20,4)
now try call mappartitions on this:-
scala> rdd.mappartitions(x=> { println(x.size); x }).collect 5 5 5 5 res98: array[int] = array()
why return empty array? anonymoys function returning same iterator received, how returning empty array? interesting part if remove println statement, indeed returns non empty array:-
scala> rdd.mappartitions(x=> { x }).collect res101: array[int] = array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
this don't understand. how come presence of println (which printing size of iterator) affecting final outcome of function?
that's because x
traversableonce
, means traversed calling size
, returned back....empty.
you work around number of ways, here one:
rdd.mappartitions(x=> { val list = x.tolist; println(list.size); list.toiterator }).collect
Comments
Post a Comment