Re: Assigning input files to spark partitions

Daniel Siegmann Thu, 13 Nov 2014 08:05:48 -0800

I believe Rishi is correct. I wouldn't rely on that though - all it would
take is for one file to exceed the block size and you'd be setting yourself
up for pain. Also, if your files are small - small enough to fit in a
single record - you could use SparkContext.wholeTextFile.


On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav <[email protected]> wrote:

> If your data is in hdfs and you are reading as textFile and each file is
> less than block size, my understanding is it would always have one
> partition per file.
>
>
> On Thursday, November 13, 2014, Daniel Siegmann <[email protected]>
> wrote:
>
>> Would it make sense to read each file in as a separate RDD? This way you
>> would be guaranteed the data is partitioned as you expected.
>>
>> Possibly you could then repartition each of those RDDs into a single
>> partition and then union them. I think that would achieve what you expect.
>> But it would be easy to accidentally screw this up (have some operation
>> that causes a shuffle), so I think you're better off just leaving them as
>> separate RDDs.
>>
>> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I have a set of input files for a spark program, with each file
>>> corresponding to a logical data partition. What is the API/mechanism to
>>> assign each input file (or a set of files) to a spark partition, when
>>> initializing RDDs?
>>>
>>> When i create a spark RDD pointing to the directory of files, my
>>> understanding is it's not guaranteed that each input file will be treated
>>> as separate partition.
>>>
>>> My job semantics require that the data is partitioned, and i want to
>>> leverage the partitioning that has already been done, rather than
>>> repartitioning again in the spark job.
>>>
>>> I tried to lookup online but haven't found any pointers so far.
>>>
>>>
>>> Thanks
>>> pala
>>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 54 W 40th St, New York, NY 10018
>> E: [email protected] W: www.velos.io
>>
>
>
> --
> - Rishi
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: [email protected] W: www.velos.io

Re: Assigning input files to spark partitions

Reply via email to