Each folder should have no dups. Dups only exist among different folders.
The logic inside is that only take the longest string value for each key.
The current problem is exceeding the largest frame size when trying to write to
hdfs, which is 500m which setting is 80m.
Sent from my iPhone
ength)
> {a}
> else
> {b}
> }
> )
> nodups.saveAsTextFile("/nodups")
>
> Anything I could do to make this process faster? Right now my process
> dies
> when output the data to the HDFS.
>
>
> Thank you !
>
>
>
> --
> V
Can you do dedupe process locally for each file first and then globally?
Also I did not fully get the logic of the part inside reducebykey. Can you
kindly explain?
On 14 Jun 2015 13:58, "Gavin Yue" wrote:
> I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
> totally 5TB da
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So
totally 5TB data.
The data is formatted as key t/ value. After union, I want to remove
the duplicates among keys. So each key should be unique and has only one
value.
Here is what I am doing.
folders = Array("folder1"
S.
Thank you !
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---