Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue
Each folder should have no dups. Dups only exist among different folders. The logic inside is that only take the longest string value for each key. The current problem is exceeding the largest frame size when trying to write to hdfs, which is 500m which setting is 80m. Sent from my iPhone

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
ength) > {a} > else > {b} > } > ) > nodups.saveAsTextFile("/nodups") > > Anything I could do to make this process faster? Right now my process > dies > when output the data to the HDFS. > > > Thank you ! > > > > -- > V

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread ayan guha
Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, "Gavin Yue" wrote: > I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB da

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folder1"

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
S. Thank you ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---