Thanks for answer!

Why sequence files though, why not to work directly on RDDs? My input files are 
CSVs and often contain some garbage both at the beginning and end of a file. 
Mind that I am working in Python, I am not sure if it will be as efficient as 
intended.

Any examples in PySpark will be much welcomed.

Thanks!
Lucas

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 24 September 2015 00:19
To: Tracewski, Lukasz (KFDB 3)
Cc: user@spark.apache.org
Subject: Re: Join over many small files

I think this can be a good case for using sequence file format to pack many 
files to few sequence files with file name as key andd content as value. Then 
read it as RDD and produce tuples like you mentioned (key=fileno+id, 
value=value). After that, it is a simple map operation to generate the diff 
(broadcasting special file is right idea).

On Thu, Sep 24, 2015 at 7:31 AM, Tracewski, Lukasz 
<lukasz.tracew...@credit-suisse.com<mailto:lukasz.tracew...@credit-suisse.com>> 
wrote:
Hi all,

I would like you to ask for an advise on how to efficiently make a join 
operation in Spark with tens of thousands of tiny files. A single file has a 
few KB and ~50 rows. In another scenario they might have 200 KB and 2000 rows.

To give you impression how they look like:

File 01
ID | VALUE
01 | 10
02 | 12
03 | 55
…

File 02
ID | VALUE
01 | 33
02 | 21
03 | 53
…

and so on… ID is unique in a file, but repeats in every file. There is also a 
Special file which has the same form:

File Special
ID | VALUE
01 | 21
02 | 23
03 | 54
…

What I would like to get is a join of File 01..10000 with File Special to get a 
difference between values:

File Result 01 = File Special – File 01
ID | VALUE
01 | 21-10
02 | 23-12
03 | 54-53
…

And save result to a csv, meaning 10000 new files. What’s the best way of doing 
this?

My idea was the following:

1.       Read all Files with wholeTextFiles, each to a separate partition

2.       Perform map-side join with broadcast variable inside mapPartitions 
(the “Special” file will be broadcasted).

I am on Spark 1.3, but it can be upgraded if needed. Perhaps this could be done 
better in a dataframe? Then I would create one large dataframe, with additional 
“filename” key, i.e.:
File | ID | Value
01 | 01 | 10
01 | 02 | 12
01 | 03 | 55
02 | 01 | 21
02 | 02 | 23
…

What would be then a way to make an efficient query over such dataframe?

Any advice will be appreciated.

Best regards,
Lucas

==============================================================================
Please access the attached hyperlink for an important electronic communications 
disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================



--
Best Regards,
Ayan Guha


=============================================================================== 
Please access the attached hyperlink for an important electronic communications 
disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
=============================================================================== 

Reply via email to