Hi, I am working on a data pipeline in a Spark Streaming app that receives data as a CSV regularly.
After some enrichment we send the data to another storage layer(ES in the case). Some of the records in the incoming CSV might be repeated. I am trying to devise a strategy based on MD5's of the lines to avoid processing already seen lines , i wonder what would be the best approach to store this data. I would prefer the data to be located within HDFS within the same cluster. I am considering a couple of formats : - Parquet - Sequence Files - Avro - Apache Arrow (Doesn't sound to have a production version ready yet) Questions: 1. Is there any alternative approach to avoid re-processing the same rows . 2. Which data storage/technique is more indicated for this kind of set membership operation. Any help and thoughts are very much welcome . Thanks in advance, Natu