Thanks for the advice. I think you're right. I'm not sure we're going to use
HBase but starting by partitioning data into multiple buckets will be a
first step. I'll see how it performs on large datasets.

My original question though was more like: is there a spark trick i don't
know about ?
Currently here's what i'm doing:
JavaPairRDD originalData = ...;JavaPairRDD incompleteData = originalData   
.filter(KeepIncompleteData)    .map(CleanData)    .cache();List pathList =
incompleteData    .flatMap(GetPossibleConciliationPaths)    .distinct()   
.collect()JavaPairRDD conciliationRDD = null;for (String filePath : pathList
) {     JavaPairRDD fileData = sc               .textFile(filePath)             
.flatMap(ProcessData);
if (conciliationRDD == null) {          conciliationRDD = fileData;     }       
else {  
conciliationRDD = conciliationRDD .union(fileData);     }}JavaPairRDD finalData
= originalData    .filter(KeepCompleteData)   
.union(conciliationRDD.join(incompleteData))    .saveAsTextFile(dir);
The collect part is what's frightening me the most as there may be alot of
different paths.Does that seem fine ?Would an approach with HBase allow me
to simply join the incomplete data with the stored state using a key ?Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Advanced-log-processing-tp5743p6102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to