Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
JIRA ticket created at:
https://issues.apache.org/jira/browse/SPARK-6581

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 7:03 PM, Cheng Lian  wrote:

>  Thanks for the information. Verified that the _common_metadata and
> _metadata file are missing in this case when using Hadoop 1.0.4. Would you
> mind to open a JIRA for this?
>
> Cheng
>
> On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
>
> I'm using 1.0.4
>
>  Thanks,
> --
> Pei-Lun
>
> On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian  wrote:
>
>>  Hm, which version of Hadoop are you using? Actually there should also
>> be a _metadata file together with _common_metadata. I was using Hadoop
>> 2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did
>> observe cases where Spark behaves differently because of semantic
>> differences of the same API in different Hadoop versions.
>>
>> Cheng
>>
>> On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
>>
>> Hi Cheng,
>>
>>  on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode.
>> Overwrite) produces:
>>
>>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
>> total 32
>> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
>>
>>  while res0.save("xxx") produces:
>>
>>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
>> total 40
>> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
>> -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
>> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
>>
>> On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian 
>> wrote:
>>
>>>  I couldn’t reproduce this with the following spark-shell snippet:
>>>
>>> scala> import sqlContext.implicits._
>>> scala> Seq((1, 2)).toDF("a", "b")
>>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>>>
>>> The _common_metadata file is typically much smaller than _metadata,
>>> because it doesn’t contain row group information, and thus can be faster to
>>> read than _metadata.
>>>
>>> Cheng
>>>
>>> On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
>>>
>>> Hi,
>>>
>>>  When I save parquet file with SaveMode.Overwrite, it never generate
>>> _common_metadata. Whether it overwrites an existing dir or not.
>>> Is this expected behavior?
>>> And what is the benefit of _common_metadata? Will reading performs
>>> better when it is present?
>>>
>>>  Thanks,
>>> --
>>> Pei-Lun
>>>
>>>  ​
>>>
>>
>>
>>
>
>


Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
Thanks for the information. Verified that the _common_metadata and 
_metadata file are missing in this case when using Hadoop 1.0.4. Would 
you mind to open a JIRA for this?


Cheng

On 3/27/15 2:40 PM, Pei-Lun Lee wrote:

I'm using 1.0.4

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian > wrote:


Hm, which version of Hadoop are you using? Actually there should
also be a _metadata file together with _common_metadata. I was
using Hadoop 2.4.1 btw. I'm not sure whether Hadoop version
matters here, but I did observe cases where Spark behaves
differently because of semantic differences of the same API in
different Hadoop versions.

Cheng

On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

Hi Cheng,

on my computer, execute res0.save("xxx",
org.apache.spark.sql.SaveMode.Overwrite) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
 ls
-l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
part-r-4.parquet*

while res0.save("xxx") produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$
 ls
-l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29
part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29
part-r-4.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian
mailto:lian.cs@gmail.com>> wrote:

I couldn’t reproduce this with the following spark-shell snippet:

|scala> import sqlContext.implicits._
scala> Seq((1, 2)).toDF("a", "b")
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
|

The _common_metadata file is typically much smaller than
_metadata, because it doesn’t contain row group information,
and thus can be faster to read than _metadata.

Cheng

On 3/26/15 12:48 PM, Pei-Lun Lee wrote:


Hi,

When I save parquet file with SaveMode.Overwrite, it never
generate _common_metadata. Whether it overwrites an existing
dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading
performs better when it is present?

Thanks,
--
Pei-Lun

​









Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
I'm using 1.0.4

Thanks,
--
Pei-Lun

On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian  wrote:

>  Hm, which version of Hadoop are you using? Actually there should also be
> a _metadata file together with _common_metadata. I was using Hadoop 2.4.1
> btw. I'm not sure whether Hadoop version matters here, but I did observe
> cases where Spark behaves differently because of semantic differences of
> the same API in different Hadoop versions.
>
> Cheng
>
> On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
>
> Hi Cheng,
>
>  on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode.
> Overwrite) produces:
>
>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
> total 32
> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
>
>  while res0.save("xxx") produces:
>
>  peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
> total 40
> -rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
> -rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
> -rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
> -rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*
>
> On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian  wrote:
>
>>  I couldn’t reproduce this with the following spark-shell snippet:
>>
>> scala> import sqlContext.implicits._
>> scala> Seq((1, 2)).toDF("a", "b")
>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>>
>> The _common_metadata file is typically much smaller than _metadata,
>> because it doesn’t contain row group information, and thus can be faster to
>> read than _metadata.
>>
>> Cheng
>>
>> On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
>>
>> Hi,
>>
>>  When I save parquet file with SaveMode.Overwrite, it never generate
>> _common_metadata. Whether it overwrites an existing dir or not.
>> Is this expected behavior?
>> And what is the benefit of _common_metadata? Will reading performs better
>> when it is present?
>>
>>  Thanks,
>> --
>> Pei-Lun
>>
>>  ​
>>
>
>
>


Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
Hm, which version of Hadoop are you using? Actually there should also be 
a _metadata file together with _common_metadata. I was using Hadoop 
2.4.1 btw. I'm not sure whether Hadoop version matters here, but I did 
observe cases where Spark behaves differently because of semantic 
differences of the same API in different Hadoop versions.


Cheng

On 3/27/15 11:33 AM, Pei-Lun Lee wrote:

Hi Cheng,

on my computer, execute res0.save("xxx", 
org.apache.spark.sql.SaveMode.Overwrite) produces:


peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

while res0.save("xxx") produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian > wrote:


I couldn’t reproduce this with the following spark-shell snippet:

|scala> import sqlContext.implicits._
scala> Seq((1, 2)).toDF("a", "b")
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
|

The _common_metadata file is typically much smaller than
_metadata, because it doesn’t contain row group information, and
thus can be faster to read than _metadata.

Cheng

On 3/26/15 12:48 PM, Pei-Lun Lee wrote:


Hi,

When I save parquet file with SaveMode.Overwrite, it never
generate _common_metadata. Whether it overwrites an existing dir
or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading
performs better when it is present?

Thanks,
--
Pei-Lun

​






Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
Hi Cheng,

on my computer, execute res0.save("xxx", org.apache.spark.sql.SaveMode.
Overwrite) produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 32
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

while res0.save("xxx") produces:

peilunlee@pllee-mini:~/opt/spark-1.3...rc3-bin-hadoop1$ ls -l xxx
total 40
-rwxrwxrwx  1 peilunlee  staff0 Mar 27 11:29 _SUCCESS*
-rwxrwxrwx  1 peilunlee  staff  250 Mar 27 11:29 _common_metadata*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-1.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-2.parquet*
-rwxrwxrwx  1 peilunlee  staff  272 Mar 27 11:29 part-r-3.parquet*
-rwxrwxrwx  1 peilunlee  staff  488 Mar 27 11:29 part-r-4.parquet*

On Thu, Mar 26, 2015 at 7:26 PM, Cheng Lian  wrote:

>  I couldn’t reproduce this with the following spark-shell snippet:
>
> scala> import sqlContext.implicits._
> scala> Seq((1, 2)).toDF("a", "b")
> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
> scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
>
> The _common_metadata file is typically much smaller than _metadata,
> because it doesn’t contain row group information, and thus can be faster to
> read than _metadata.
>
> Cheng
>
> On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
>
>   Hi,
>
>  When I save parquet file with SaveMode.Overwrite, it never generate
> _common_metadata. Whether it overwrites an existing dir or not.
> Is this expected behavior?
> And what is the benefit of _common_metadata? Will reading performs better
> when it is present?
>
>  Thanks,
> --
> Pei-Lun
>
>   ​
>


Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian

I couldn’t reproduce this with the following spark-shell snippet:

|scala> import sqlContext.implicits._
scala> Seq((1, 2)).toDF("a", "b")
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
scala> res0.save("xxx", org.apache.spark.sql.SaveMode.Overwrite)
|

The _common_metadata file is typically much smaller than _metadata, 
because it doesn’t contain row group information, and thus can be faster 
to read than _metadata.


Cheng

On 3/26/15 12:48 PM, Pei-Lun Lee wrote:


Hi,

When I save parquet file with SaveMode.Overwrite, it never generate 
_common_metadata. Whether it overwrites an existing dir or not.

Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs 
better when it is present?


Thanks,
--
Pei-Lun


​


SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi,

When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?

Thanks,
--
Pei-Lun