Thanks for answer! Why sequence files though, why not to work directly on RDDs? My input files are CSVs and often contain some garbage both at the beginning and end of a file. Mind that I am working in Python, I am not sure if it will be as efficient as intended.
Any examples in PySpark will be much welcomed. Thanks! Lucas From: ayan guha [mailto:guha.a...@gmail.com] Sent: 24 September 2015 00:19 To: Tracewski, Lukasz (KFDB 3) Cc: user@spark.apache.org Subject: Re: Join over many small files I think this can be a good case for using sequence file format to pack many files to few sequence files with file name as key andd content as value. Then read it as RDD and produce tuples like you mentioned (key=fileno+id, value=value). After that, it is a simple map operation to generate the diff (broadcasting special file is right idea). On Thu, Sep 24, 2015 at 7:31 AM, Tracewski, Lukasz <lukasz.tracew...@credit-suisse.com<mailto:lukasz.tracew...@credit-suisse.com>> wrote: Hi all, I would like you to ask for an advise on how to efficiently make a join operation in Spark with tens of thousands of tiny files. A single file has a few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows. To give you impression how they look like: File 01 ID | VALUE 01 | 10 02 | 12 03 | 55 … File 02 ID | VALUE 01 | 33 02 | 21 03 | 53 … and so on… ID is unique in a file, but repeats in every file. There is also a Special file which has the same form: File Special ID | VALUE 01 | 21 02 | 23 03 | 54 … What I would like to get is a join of File 01..10000 with File Special to get a difference between values: File Result 01 = File Special – File 01 ID | VALUE 01 | 21-10 02 | 23-12 03 | 54-53 … And save result to a csv, meaning 10000 new files. What’s the best way of doing this? My idea was the following: 1. Read all Files with wholeTextFiles, each to a separate partition 2. Perform map-side join with broadcast variable inside mapPartitions (the “Special” file will be broadcasted). I am on Spark 1.3, but it can be upgraded if needed. Perhaps this could be done better in a dataframe? Then I would create one large dataframe, with additional “filename” key, i.e.: File | ID | Value 01 | 01 | 10 01 | 02 | 12 01 | 03 | 55 02 | 01 | 21 02 | 02 | 23 … What would be then a way to make an efficient query over such dataframe? Any advice will be appreciated. Best regards, Lucas ============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ============================================================================== -- Best Regards, Ayan Guha =============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ===============================================================================