Question on set membership / diff sync technique in Spark

Natu Lauchande Tue, 26 Jul 2016 06:46:00 -0700

Hi,

I am working on a data pipeline in a Spark Streaming app that receives data
as a CSV regularly.


After some enrichment we send the data to another storage layer(ES in the
case). Some of the records in the incoming CSV might be repeated.

I am trying to devise a strategy based on MD5's of the lines to avoid
processing already seen lines , i wonder what would be the best approach
to store this data. I would prefer the data to be located within HDFS
within the same cluster.

I am considering a couple of formats :
- Parquet
- Sequence Files
- Avro
- Apache Arrow (Doesn't sound to have a production version ready yet)

Questions:

1. Is there any alternative approach to avoid re-processing the same rows .

2. Which data storage/technique is more indicated for this kind of set
membership operation.

Any help and thoughts are very much welcome .

Thanks in advance,
Natu

Question on set membership / diff sync technique in Spark

Reply via email to