Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread Ryan Blue
Hi Ashok,

The schema for your data comes from the data frame you're using in Spark
and resolved with a Hive table schema if you are writing to one. For
encodings, you don't need to configure them because they are selected for
your data automatically. For example, Parquet will try dictionary-encoding
first and fall back to non-dictionary if it looks like the
dictionary-encoding would take more space.

I recommend writing out a data frame to Parquet and then just taking a look
at the result using parquet-tools, which you can download from maven
central.

rb

On Thu, Mar 3, 2016 at 10:50 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi Ted,
>
> Thanks for pointing out this. This page has mailing list for developers but
> not for users yet it seems. Including developers mailing list only.
>
> Hi Parquet Team,
>
> Could you please clarify the question below? Please let me know if there is
> a separate mailing list for users but not developers.
>
> Regards
> Ashok
>
> On Fri, Mar 4, 2016 at 11:01 AM, Ted Yu  wrote:
>
> > Have you taken a look at https://parquet.apache.org/community/ ?
> >
> > On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
> > ashokkumar.rajend...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> I am exploring to use Apache Parquet with Spark SQL in our project. I
> >> notice that Apache Parquet uses different encoding for different
> columns.
> >> The dictionary encoding in Parquet will be one of the good ones for our
> >> performance. I do not see much documentation in Spark or Parquet on how
> to
> >> configure this. For example, how would Parquet know dictionary of words
> if
> >> there is no schema provided by user? Where/how to specify my schema /
> >> config for Parquet format?
> >>
> >> Could not find Apache Parquet mailing list in the official site. It
> would
> >> be great if anyone could share it as well.
> >>
> >> Regards
> >> Ashok
> >>
> >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread ashokkumar rajendran
Thanks for the clarification Xinh.



On Fri, Mar 4, 2016 at 12:30 PM, Xinh Huynh  wrote:

> Hi Ashok,
>
> On the Spark SQL side, when you create a dataframe, it will have a schema
> (each column has a type such as Int or String). Then when you save that
> dataframe as parquet format, Spark translates the dataframe schema into
> Parquet data types. (See spark.sql.execution.datasources.parquet.) Then
> Parquet does the dictionary encoding automatically (if applicable) based on
> the data values; this encoding is not specified by the user. Parquet
> figures out the right encoding to use for you.
>
> Xinh
>
> > On Mar 3, 2016, at 7:32 PM, ashokkumar rajendran <
> ashokkumar.rajend...@gmail.com> wrote:
> >
> > Hi,
> >
> > I am exploring to use Apache Parquet with Spark SQL in our project. I
> notice that Apache Parquet uses different encoding for different columns.
> The dictionary encoding in Parquet will be one of the good ones for our
> performance. I do not see much documentation in Spark or Parquet on how to
> configure this. For example, how would Parquet know dictionary of words if
> there is no schema provided by user? Where/how to specify my schema /
> config for Parquet format?
> >
> > Could not find Apache Parquet mailing list in the official site. It
> would be great if anyone could share it as well.
> >
> > Regards
> > Ashok
>


Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread Xinh Huynh
Hi Ashok,

On the Spark SQL side, when you create a dataframe, it will have a schema (each 
column has a type such as Int or String). Then when you save that dataframe as 
parquet format, Spark translates the dataframe schema into Parquet data types. 
(See spark.sql.execution.datasources.parquet.) Then Parquet does the dictionary 
encoding automatically (if applicable) based on the data values; this encoding 
is not specified by the user. Parquet figures out the right encoding to use for 
you.

Xinh

> On Mar 3, 2016, at 7:32 PM, ashokkumar rajendran 
>  wrote:
> 
> Hi, 
> 
> I am exploring to use Apache Parquet with Spark SQL in our project. I notice 
> that Apache Parquet uses different encoding for different columns. The 
> dictionary encoding in Parquet will be one of the good ones for our 
> performance. I do not see much documentation in Spark or Parquet on how to 
> configure this. For example, how would Parquet know dictionary of words if 
> there is no schema provided by user? Where/how to specify my schema / config 
> for Parquet format?
> 
> Could not find Apache Parquet mailing list in the official site. It would be 
> great if anyone could share it as well.
> 
> Regards
> Ashok

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread ashokkumar rajendran
Hi Ted,

Thanks for pointing out this. This page has mailing list for developers but
not for users yet it seems. Including developers mailing list only.

Hi Parquet Team,

Could you please clarify the question below? Please let me know if there is
a separate mailing list for users but not developers.

Regards
Ashok

On Fri, Mar 4, 2016 at 11:01 AM, Ted Yu  wrote:

> Have you taken a look at https://parquet.apache.org/community/ ?
>
> On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
> ashokkumar.rajend...@gmail.com> wrote:
>
>> Hi,
>>
>> I am exploring to use Apache Parquet with Spark SQL in our project. I
>> notice that Apache Parquet uses different encoding for different columns.
>> The dictionary encoding in Parquet will be one of the good ones for our
>> performance. I do not see much documentation in Spark or Parquet on how to
>> configure this. For example, how would Parquet know dictionary of words if
>> there is no schema provided by user? Where/how to specify my schema /
>> config for Parquet format?
>>
>> Could not find Apache Parquet mailing list in the official site. It would
>> be great if anyone could share it as well.
>>
>> Regards
>> Ashok
>>
>
>


Re: Do we need schema for Parquet files with Spark?

2016-03-03 Thread Ted Yu
Have you taken a look at https://parquet.apache.org/community/ ?

On Thu, Mar 3, 2016 at 7:32 PM, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi,
>
> I am exploring to use Apache Parquet with Spark SQL in our project. I
> notice that Apache Parquet uses different encoding for different columns.
> The dictionary encoding in Parquet will be one of the good ones for our
> performance. I do not see much documentation in Spark or Parquet on how to
> configure this. For example, how would Parquet know dictionary of words if
> there is no schema provided by user? Where/how to specify my schema /
> config for Parquet format?
>
> Could not find Apache Parquet mailing list in the official site. It would
> be great if anyone could share it as well.
>
> Regards
> Ashok
>