If your data is in hdfs and you are reading as textFile and each file is less than block size, my understanding is it would always have one partition per file.
On Thursday, November 13, 2014, Daniel Siegmann <[email protected]> wrote: > Would it make sense to read each file in as a separate RDD? This way you > would be guaranteed the data is partitioned as you expected. > > Possibly you could then repartition each of those RDDs into a single > partition and then union them. I think that would achieve what you expect. > But it would be easy to accidentally screw this up (have some operation > that causes a shuffle), so I think you're better off just leaving them as > separate RDDs. > > On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi, >> >> I have a set of input files for a spark program, with each file >> corresponding to a logical data partition. What is the API/mechanism to >> assign each input file (or a set of files) to a spark partition, when >> initializing RDDs? >> >> When i create a spark RDD pointing to the directory of files, my >> understanding is it's not guaranteed that each input file will be treated >> as separate partition. >> >> My job semantics require that the data is partitioned, and i want to >> leverage the partitioning that has already been done, rather than >> repartitioning again in the spark job. >> >> I tried to lookup online but haven't found any pointers so far. >> >> >> Thanks >> pala >> > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 54 W 40th St, New York, NY 10018 > E: [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');> W: www.velos.io > -- - Rishi
