Thanks for the advice. I think you're right. I'm not sure we're going to use
HBase but starting by partitioning data into multiple buckets will be a
first step. I'll see how it performs on large datasets.
My original question though was more like: is there a spark trick i don't
know about ?
Curren
It seems you are not reducing the data in size. If you are not then you are
better off partitioning the data into buckets (folders?) & keep data sorted
in those buckets ..
A more cleaner approach is to use HBase to keep track of keys & keep adding
keys as you find them & let hbase handle it.
Mayur
(resending this as alot of mails seems not to be delivered)
Hi,
I have some complex behavior i'd like to be advised on as i'm really new to
Spark.
I'm reading some log files that contains various events. There are two types
of events: parents and children. A child event can only have one pare