Re: How do we control output part files created by Spark job?

Srikanth Fri, 10 Jul 2015 19:12:07 -0700

Is there a join involved in your sql?
Have a look at spark.sql.shuffle.partitions?


Srikanth

On Wed, Jul 8, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> wrote:

> Hi Srikant thanks for the response. I have the following code:
>
> hiveContext.sql("insert into... ").coalesce(6)
>
> Above code does not create 6 part files it creates around 200 small files.
>
> Please guide. Thanks.
> On Jul 8, 2015 4:07 AM, "Srikanth" <srikanth...@gmail.com> wrote:
>
>> Did you do
>>
>>         yourRdd.coalesce(6).saveAsTextFile()
>>
>>                         or
>>
>>         yourRdd.coalesce(6)
>>         yourRdd.saveAsTextFile()
>> ?
>>
>> Srikanth
>>
>> On Tue, Jul 7, 2015 at 12:59 PM, Umesh Kacha <umesh.ka...@gmail.com>
>> wrote:
>>
>>> Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
>>> doesn't reduce part-xxxxx files. Even after calling above method I still
>>> see around 200 small part files of size 20 mb each which is again orc files.
>>>
>>>
>>> On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
>>> vsathishkuma...@gmail.com> wrote:
>>>
>>>> Try coalesce function to limit no of part files
>>>> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>>>>
>>>>> Hi I am having couple of Spark jobs which processes thousands of files
>>>>> every
>>>>> day. File size may very from MBs to GBs. After finishing job I usually
>>>>> save
>>>>> using the following code
>>>>>
>>>>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>>>>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC
>>>>> file as
>>>>> of Spark 1.4
>>>>>
>>>>> Spark job creates plenty of small part files in final output
>>>>> directory. As
>>>>> far as I understand Spark creates part file for each partition/task
>>>>> please
>>>>> correct me if I am wrong. How do we control amount of part files Spark
>>>>> creates? Finally I would like to create Hive table using these
>>>>> parquet/orc
>>>>> directory and I heard Hive is slow when we have large no of small
>>>>> files.
>>>>> Please guide I am new to Spark. Thanks in advance.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>
>>

Re: How do we control output part files created by Spark job?

Reply via email to