Hi Lan thanks for the reply. I have tried to do the following but it did not increase partition
DataFrame df = hiveContext.read().format("orc").load("/hdfs/path/to/orc/ files/").repartition(100); Yes I have checked in namenode ui ORC files contains 12 files/blocks of 128 MB each and ORC files when decompressed its around 10 GB and its uncompressed file size is around 1 GB On Fri, Oct 9, 2015 at 12:43 AM, Lan Jiang <ljia...@gmail.com> wrote: > Hmm, that’s odd. > > You can always use repartition(n) to increase the partition number, but > then there will be shuffle. How large is your ORC file? Have you used > NameNode UI to check how many HDFS blocks each ORC file has? > > Lan > > > On Oct 8, 2015, at 2:08 PM, Umesh Kacha <umesh.ka...@gmail.com> wrote: > > Hi Lan, thanks for the response yes I know and I have confirmed in UI that > it has only 12 partitions because of 12 HDFS blocks and hive orc file strip > size is 33554432. > > On Thu, Oct 8, 2015 at 11:55 PM, Lan Jiang <ljia...@gmail.com> wrote: > >> The partition number should be the same as the HDFS block number instead >> of file number. Did you confirmed from the spark UI that only 12 partitions >> were created? What is your ORC orc.stripe.size? >> >> Lan >> >> >> > On Oct 8, 2015, at 1:13 PM, unk1102 <umesh.ka...@gmail.com> wrote: >> > >> > Hi I have the following code where I read ORC files from HDFS and it >> loads >> > directory which contains 12 ORC files. Now since HDFS directory >> contains 12 >> > files it will create 12 partitions by default. These directory is huge >> and >> > when ORC files gets decompressed it becomes around 10 GB how do I >> increase >> > partitions for the below code so that my Spark job runs faster and does >> not >> > hang for long time because of reading 10 GB files through shuffle in 12 >> > partitions. Please guide. >> > >> > DataFrame df = >> > hiveContext.read().format("orc").load("/hdfs/path/to/orc/files/"); >> > df.select().groupby(..) >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-increase-Spark-partitions-for-the-DataFrame-tp24980.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com >> . >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> > >