Classification: Public
Hi
I have a requirement to aggregate a large data set in Spark across a multi
level (25 levels) hierarchy. The data model (simplified) is as follows
Measures
leafNode Long
measureType String
measureValue Array[Float]
Hierarchy (expanded) - a typical organisation hierarchy.
leafNode Long /* Account also called level0Node */
level1Node Long /* Portfolio */
level2Node Long /* Sub Fund */
level3Node Long /* Fund */
level4Node Long
...
...
level25Node Long /* organisation*/
alternative representation
node
parentNode
hierarchylevel
Output Format
Level Int /* 0-25*/
Node Long
measureType String
measureValue Array[Float]
I can do the aggregation by joining both the RDDs together and aggregating each
level one at a time. I was wondering if there was a more efficient way of doing
this in spark? Maybe a recursive algorithm that traverses the tree?
Currently the measures data set is loaded in a batch fashion, but I am working
on getting incremental feeds of measures using Spark streaming.
Deenar
---
This e-mail may contain confidential and/or privileged information. If you are
not the intended recipient (or have received this e-mail in error) please
notify the sender immediately and delete this e-mail. Any unauthorized copying,
disclosure or distribution of the material in this e-mail is strictly forbidden.
Please refer to http://www.db.com/en/content/eu_disclosures.htm for additional
EU corporate and regulatory disclosures.