Efficient hierarchical aggregation in Spark

Deenar Toraskar Tue, 24 Dec 2013 00:34:02 -0800

Classification: Public
Hi

I have a requirement to aggregate a large data set in Spark across a multi 
level (25 levels) hierarchy. The data model (simplified) is as follows


Measures
leafNode           Long
measureType   String
measureValue Array[Float]

Hierarchy (expanded) - a typical organisation hierarchy.
leafNode             Long /* Account also called level0Node */
level1Node        Long /* Portfolio */
level2Node        Long /* Sub Fund */
level3Node        Long /* Fund */
level4Node        Long
                ...
                ...
level25Node      Long /* organisation*/

alternative representation
node
parentNode
hierarchylevel

Output Format
Level                     Int                   /* 0-25*/
Node                     Long
measureType     String
measureValue    Array[Float]

I can do the aggregation by joining both the RDDs together and aggregating each 
level one at a time. I was wondering if there was a more efficient way of doing 
this in spark? Maybe a recursive algorithm that traverses the tree?
Currently the measures data set is loaded in a batch fashion, but I am working 
on getting incremental feeds of measures using Spark streaming.

Deenar

---
This e-mail may contain confidential and/or privileged information. If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and delete this e-mail. Any unauthorized copying, 
disclosure or distribution of the material in this e-mail is strictly forbidden.

Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional 
EU corporate and regulatory disclosures.

Efficient hierarchical aggregation in Spark

Reply via email to