Explain the aggregate functionality in Spark
I am looking for some better explanation of the aggregate functionality that is available via spark in python.
The example I have is as follows (using pyspark from Spark 1.2.0 version)
sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc, value: (acc + value, acc + 1)), (lambda acc1, acc2: (acc1 + acc2, acc1 + acc2)))
I get the expected result
(10,4) which is sum of
1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to
(0,0) I get the following result
sc.parallelize([1,2,3,4]).aggregate( (1, 0), (lambda acc, value: (acc + value, acc + 1)), (lambda acc1, acc2: (acc1 + acc2, acc1 + acc2)))
The value increases by 9. If I change it to
(2,0), the value goes to
(28,4) and so on.
Can someone explain to me how this value is calculated? I expected the value to go up by 1 not by 9, expected to see
(11,4) instead I am seeing
I don't have enough reputation points to comment on the previous answer by Maasg. Actually the zero value should be 'neutral' towards the seqop, meaning it wouldn't interfere with the seqop result, like 0 towards add, or 1 towards *;
You should NEVER try with non-neutral values as it might be applied arbitrary times. This behavior is not only tied to num of partitions.
I tried the same experiment as stated in the question. with 1 partition, the zero value was applied 3 times. with 2 partitions, 6 times. with 3 partitions, 9 times and this will go on.