I believe Rishi is correct. I wouldn't rely on that though - all it would take is for one file to exceed the block size and you'd be setting yourself up for pain. Also, if your files are small - small enough to fit in a single record - you could use SparkContext.wholeTextFile.
On Thu, Nov 13, 2014 at 10:11 AM, Rishi Yadav <[email protected]> wrote: > If your data is in hdfs and you are reading as textFile and each file is > less than block size, my understanding is it would always have one > partition per file. > > > On Thursday, November 13, 2014, Daniel Siegmann <[email protected]> > wrote: > >> Would it make sense to read each file in as a separate RDD? This way you >> would be guaranteed the data is partitioned as you expected. >> >> Possibly you could then repartition each of those RDDs into a single >> partition and then union them. I think that would achieve what you expect. >> But it would be easy to accidentally screw this up (have some operation >> that causes a shuffle), so I think you're better off just leaving them as >> separate RDDs. >> >> On Wed, Nov 12, 2014 at 10:27 PM, Pala M Muthaia < >> [email protected]> wrote: >> >>> Hi, >>> >>> I have a set of input files for a spark program, with each file >>> corresponding to a logical data partition. What is the API/mechanism to >>> assign each input file (or a set of files) to a spark partition, when >>> initializing RDDs? >>> >>> When i create a spark RDD pointing to the directory of files, my >>> understanding is it's not guaranteed that each input file will be treated >>> as separate partition. >>> >>> My job semantics require that the data is partitioned, and i want to >>> leverage the partitioning that has already been done, rather than >>> repartitioning again in the spark job. >>> >>> I tried to lookup online but haven't found any pointers so far. >>> >>> >>> Thanks >>> pala >>> >> >> >> >> -- >> Daniel Siegmann, Software Developer >> Velos >> Accelerating Machine Learning >> >> 54 W 40th St, New York, NY 10018 >> E: [email protected] W: www.velos.io >> > > > -- > - Rishi > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: [email protected] W: www.velos.io
