Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Chanh Le
Hi Gene,
Thank for your support. I agree with you because of number executor but many 
parquet files influence to read performance so I need a way to improve that. So 
the way I work around is 
  df.coalesce(1)
  .write.mode(SaveMode.Overwrite).partitionBy("network_id")
  .parquet(s"$alluxioURL/$outFolderName/time=${dailyFormat.print(jobRunTime)}")
I know this is not good because create a shuffle and cost time but the read 
improve a lot. Right now, I am using that method to partition my data.

Regards,
Chanh


> On Jul 8, 2016, at 8:33 PM, Gene Pang  wrote:
> 
> Hi Chanh,
> 
> You should be able to set the Alluxio block size with:
> 
> sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb")
> 
> I think you have many parquet files because you have many Spark executors 
> writing out their partition of the files.
> 
> Hope that helps,
> Gene
> 
> On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le  > wrote:
> Hi Gene,
> Could you give some suggestions on that?
> 
> 
> 
>> On Jul 1, 2016, at 5:31 PM, Ted Yu > > wrote:
>> 
>> The comment from zhangxiongfei was from a year ago.
>> 
>> Maybe something changed since them ?
>> 
>> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le > > wrote:
>> Hi Ted,
>> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true)
>> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>> but It seems not working.
>> 
>> 
>> 
>> 
>>> On Jul 1, 2016, at 11:38 AM, Ted Yu >> > wrote:
>>> 
>>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is 
>>> in use.
>>> 
>>> FYI
>>> 
>>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma >> > wrote:
>>> Ok.
>>> I came across this issue.
>>> Not sure if you already assessed this:
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 
>>> 
>>> The workaround mentioned may work for you .
>>> 
>>> Thanks
>>> Deepak
>>> 
>>> On 1 Jul 2016 9:34 am, "Chanh Le" >> > wrote:
>>> Hi Deepark,
>>> Thank for replying. The way to write into alluxio is 
>>> df.write.mode(SaveMode.Append).partitionBy("network_id", 
>>> "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY <>”)
>>> 
>>> 
>>> I partition by 2 columns and store. I just want when I write it automatic 
>>> write a size properly for what I already set in Alluxio 512MB per block.
>>> 
>>> 
 On Jul 1, 2016, at 11:01 AM, Deepak Sharma > wrote:
 
 Before writing coalesing your rdd to 1 .
 It will create only 1 output file .
 Multiple part file happens as all your executors will be writing their 
 partitions to separate part files.
 
 Thanks
 Deepak
 
 On 1 Jul 2016 8:01 am, "Chanh Le" > wrote:
 Hi everyone,
 I am using Alluxio for storage. But I am little bit confuse why I am do 
 set block size of alluxio is 512MB and my file part only few KB and too 
 many part.
 Is that normal? Because I want to read it fast? Is that many part effect 
 the read operation?
 How to set the size of file part?
 
 Thanks.
 Chanh
 
 
 
  
 
 
>>> 
>>> 
>> 
>> 
> 
> 



Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-08 Thread Gene Pang
Hi Chanh,

You should be able to set the Alluxio block size with:

sc.hadoopConfiguration.set("alluxio.user.block.size.bytes.default", "256mb")

I think you have many parquet files because you have many Spark executors
writing out their partition of the files.

Hope that helps,
Gene

On Sun, Jul 3, 2016 at 8:02 PM, Chanh Le  wrote:

> Hi Gene,
> Could you give some suggestions on that?
>
>
>
> On Jul 1, 2016, at 5:31 PM, Ted Yu  wrote:
>
> The comment from zhangxiongfei was from a year ago.
>
> Maybe something changed since them ?
>
> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le  wrote:
>
>> Hi Ted,
>> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache",
>> true)
>>
>> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
>>
>> but It seems not working.
>>
>> 
>>
>>
>> On Jul 1, 2016, at 11:38 AM, Ted Yu  wrote:
>>
>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache"
>> is in use.
>>
>> FYI
>>
>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma 
>> wrote:
>>
>>> Ok.
>>> I came across this issue.
>>> Not sure if you already assessed this:
>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>>>
>>> The workaround mentioned may work for you .
>>>
>>> Thanks
>>> Deepak
>>> On 1 Jul 2016 9:34 am, "Chanh Le"  wrote:
>>>
 Hi Deepark,
 Thank for replying. The way to write into alluxio is
 df.write.mode(SaveMode.Append).partitionBy("network_id", "time"
 ).parquet("alluxio://master1:1/FACT_ADMIN_HOURLY”)


 I partition by 2 columns and store. I just want when I write
 it automatic write a size properly for what I already set in Alluxio 512MB
 per block.


 On Jul 1, 2016, at 11:01 AM, Deepak Sharma 
 wrote:

 Before writing coalesing your rdd to 1 .
 It will create only 1 output file .
 Multiple part file happens as all your executors will be writing their
 partitions to separate part files.

 Thanks
 Deepak
 On 1 Jul 2016 8:01 am, "Chanh Le"  wrote:

 Hi everyone,
 I am using Alluxio for storage. But I am little bit confuse why I am do
 set block size of alluxio is 512MB and my file part only few KB and too
 many part.
 Is that normal? Because I want to read it fast? Is that many part
 effect the read operation?
 How to set the size of file part?

 Thanks.
 Chanh





 



>>
>>
>
>


Re: Why so many parquet file part when I store data in Alluxio or File?

2016-07-03 Thread Chanh Le
Hi Gene,
Could you give some suggestions on that?



> On Jul 1, 2016, at 5:31 PM, Ted Yu  wrote:
> 
> The comment from zhangxiongfei was from a year ago.
> 
> Maybe something changed since them ?
> 
> On Fri, Jul 1, 2016 at 12:07 AM, Chanh Le  > wrote:
> Hi Ted,
> I set sc.hadoopConfiguration.setBoolean("fs.hdfs.impl.disable.cache", true)
> sc.hadoopConfiguration.setLong("fs.local.block.size", 268435456)
> but It seems not working.
> 
> 
> 
> 
>> On Jul 1, 2016, at 11:38 AM, Ted Yu > > wrote:
>> 
>> Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is 
>> in use.
>> 
>> FYI
>> 
>> On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma > > wrote:
>> Ok.
>> I came across this issue.
>> Not sure if you already assessed this:
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921 
>> 
>> The workaround mentioned may work for you .
>> 
>> Thanks
>> Deepak
>> 
>> On 1 Jul 2016 9:34 am, "Chanh Le" > > wrote:
>> Hi Deepark,
>> Thank for replying. The way to write into alluxio is 
>> df.write.mode(SaveMode.Append).partitionBy("network_id", 
>> "time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY <>”)
>> 
>> 
>> I partition by 2 columns and store. I just want when I write it automatic 
>> write a size properly for what I already set in Alluxio 512MB per block.
>> 
>> 
>>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma >> > wrote:
>>> 
>>> Before writing coalesing your rdd to 1 .
>>> It will create only 1 output file .
>>> Multiple part file happens as all your executors will be writing their 
>>> partitions to separate part files.
>>> 
>>> Thanks
>>> Deepak
>>> 
>>> On 1 Jul 2016 8:01 am, "Chanh Le" >> > wrote:
>>> Hi everyone,
>>> I am using Alluxio for storage. But I am little bit confuse why I am do set 
>>> block size of alluxio is 512MB and my file part only few KB and too many 
>>> part.
>>> Is that normal? Because I want to read it fast? Is that many part effect 
>>> the read operation?
>>> How to set the size of file part?
>>> 
>>> Thanks.
>>> Chanh
>>> 
>>> 
>>> 
>>>  
>>> 
>>> 
>> 
>> 
> 
> 



Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Ted Yu
Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is
in use.

FYI

On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma 
wrote:

> Ok.
> I came across this issue.
> Not sure if you already assessed this:
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>
> The workaround mentioned may work for you .
>
> Thanks
> Deepak
> On 1 Jul 2016 9:34 am, "Chanh Le"  wrote:
>
>> Hi Deepark,
>> Thank for replying. The way to write into alluxio is
>> df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet(
>> "alluxio://master1:1/FACT_ADMIN_HOURLY”)
>>
>>
>> I partition by 2 columns and store. I just want when I write it automatic
>> write a size properly for what I already set in Alluxio 512MB per block.
>>
>>
>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma  wrote:
>>
>> Before writing coalesing your rdd to 1 .
>> It will create only 1 output file .
>> Multiple part file happens as all your executors will be writing their
>> partitions to separate part files.
>>
>> Thanks
>> Deepak
>> On 1 Jul 2016 8:01 am, "Chanh Le"  wrote:
>>
>> Hi everyone,
>> I am using Alluxio for storage. But I am little bit confuse why I am do
>> set block size of alluxio is 512MB and my file part only few KB and too
>> many part.
>> Is that normal? Because I want to read it fast? Is that many part effect
>> the read operation?
>> How to set the size of file part?
>>
>> Thanks.
>> Chanh
>>
>>
>>
>>
>>
>> 
>>
>>
>>


Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Deepak Sharma
Ok.
I came across this issue.
Not sure if you already assessed this:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921

The workaround mentioned may work for you .

Thanks
Deepak
On 1 Jul 2016 9:34 am, "Chanh Le"  wrote:

> Hi Deepark,
> Thank for replying. The way to write into alluxio is
> df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet("
> alluxio://master1:1/FACT_ADMIN_HOURLY”)
>
>
> I partition by 2 columns and store. I just want when I write it automatic
> write a size properly for what I already set in Alluxio 512MB per block.
>
>
> On Jul 1, 2016, at 11:01 AM, Deepak Sharma  wrote:
>
> Before writing coalesing your rdd to 1 .
> It will create only 1 output file .
> Multiple part file happens as all your executors will be writing their
> partitions to separate part files.
>
> Thanks
> Deepak
> On 1 Jul 2016 8:01 am, "Chanh Le"  wrote:
>
> Hi everyone,
> I am using Alluxio for storage. But I am little bit confuse why I am do
> set block size of alluxio is 512MB and my file part only few KB and too
> many part.
> Is that normal? Because I want to read it fast? Is that many part effect
> the read operation?
> How to set the size of file part?
>
> Thanks.
> Chanh
>
>
>
>
>
> 
>
>
>


Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Chanh Le
Hi Deepark,
Thank for replying. The way to write into alluxio is 
df.write.mode(SaveMode.Append).partitionBy("network_id", 
"time").parquet("alluxio://master1:1/FACT_ADMIN_HOURLY”)


I partition by 2 columns and store. I just want when I write it automatic write 
a size properly for what I already set in Alluxio 512MB per block.


> On Jul 1, 2016, at 11:01 AM, Deepak Sharma  wrote:
> 
> Before writing coalesing your rdd to 1 .
> It will create only 1 output file .
> Multiple part file happens as all your executors will be writing their 
> partitions to separate part files.
> 
> Thanks
> Deepak
> 
> On 1 Jul 2016 8:01 am, "Chanh Le"  > wrote:
> Hi everyone,
> I am using Alluxio for storage. But I am little bit confuse why I am do set 
> block size of alluxio is 512MB and my file part only few KB and too many part.
> Is that normal? Because I want to read it fast? Is that many part effect the 
> read operation?
> How to set the size of file part?
> 
> Thanks.
> Chanh
> 
> 
> 
>  
> 
>