Re: Parquet partitioning for unique identifier

Cheng Lian Fri, 04 Sep 2015 03:49:23 -0700

What version of Spark were you using? Have you tried increasing--executor-memory?

This schema looks pretty normal. And Parquet stores all keys of a map ina single column.


Cheng

On 9/4/15 4:00 PM, Kohki Nishio wrote:

The stack trace is this
java.lang.OutOfMemoryError: Java heap space
        at 
parquet.bytes.CapacityByteArrayOutputStream.initSlabs(CapacityByteArrayOutputStream.java:65)
        at 
parquet.bytes.CapacityByteArrayOutputStream.<init>(CapacityByteArrayOutputStream.java:57)
        at 
parquet.column.values.rle.RunLengthBitPackingHybridEncoder.<init>(RunLengthBitPackingHybridEncoder.java:125)
        at 
parquet.column.values.rle.RunLengthBitPackingHybridValuesWriter.<init>(RunLengthBitPackingHybridValuesWriter.java:36)
        at 
parquet.column.ParquetProperties.getColumnDescriptorValuesWriter(ParquetProperties.java:61)
        at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:72)
        at 
parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.java:68)
        at 
parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.java:56)
        at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnIO.java:178)
        at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
        at 
parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:108)
        at 
parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.java:94)
        at 
parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
        at 
parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)

It looks like this
https://issues.apache.org/jira/browse/PARQUET-222

Here's the schema I have, I don't think this is such different schema,... maybe use of Map is causing this. Is it trying to register all ofkeys of a map as a column ?


root
 |-- intId: integer (nullable = false)
 |-- uniqueId: string (nullable = true)
 |-- date1: string (nullable = true)
 |-- date2: string (nullable = true)
 |-- date3: string (nullable = true)
 |-- type: integer (nullable = false)
 |-- cat: string (nullable = true)
 |-- subCat: string (nullable = true)
 |-- unit: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- attr: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- price: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp1: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp2: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- imp3: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)

On Thu, Sep 3, 2015 at 11:27 PM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:


    Could you please provide the full stack track of the OOM
    exception? Another common case of Parquet OOM is super wide
    tables, say hundred or thousands of columns. And in this case, the
    number of rows is mostly irrelevant.

    Cheng


    On 9/4/15 1:24 AM, Kohki Nishio wrote:

    let's say I have a data like htis

       ID  |   Some1   |  Some2    | Some3   | ....
    A00001 | kdsfajfsa | dsafsdafa | fdsfafa  |
    A00002 | dfsfafasd | 23jfdsjkj | 980dfs   |
    A00003 | 99989df   | jksdljas  | 48dsaas  |
       ..
    Z00..  | fdsafdsfa | fdsdafdas | 89sdaff  |

    My understanding is that if I give the column 'ID' to use for
    partition, it's going to generate a file per entry since it's
    unique, no ? Using Json, I create 1000 files separated as
    specified in parallelize parameter. But json is large and a bit
    slow I'd like to try Parquet to see what happens.

    On Wed, Sep 2, 2015 at 11:15 PM, Adrien Mogenet
    <adrien.moge...@contentsquare.com
    <mailto:adrien.moge...@contentsquare.com>> wrote:

        Any code / Parquet schema to provide? I'm not sure to
        understand which step fails right there...

        On 3 September 2015 at 04:12, Raghavendra Pandey
        <raghavendra.pan...@gmail.com
        <mailto:raghavendra.pan...@gmail.com>> wrote:

            Did you specify partitioning column while saving data..

            On Sep 3, 2015 5:41 AM, "Kohki Nishio"
            <tarop...@gmail.com <mailto:tarop...@gmail.com>> wrote:

                Hello experts,

                I have a huge json file (> 40G) and trying to use
                Parquet as a file format. Each entry has a unique
                identifier but other than that, it doesn't have 'well
                balanced value' column to partition it. Right now it
                just throws OOM and couldn't figure out what to do
                with it.

                It would be ideal if I could provide a partitioner
                based on the unique identifier value like computing
                its hash value or something.  One of the option would
                be to produce a hash value and add it as a separate
                column, but it doesn't sound right to me. Is there
                any other ways I can try ?

                Regards,

--Kohki Nishio

--

        *Adrien Mogenet*
        Head of Backend/Infrastructure
        adrien.moge...@contentsquare.com
        <mailto:adrien.moge...@contentsquare.com>
        (+33)6.59.16.64.22 <tel:%28%2B33%296.59.16.64.22>
        http://www.contentsquare.com
        50, avenue Montaigne - 75008 Paris

--Kohki Nishio





--
Kohki Nishio

Re: Parquet partitioning for unique identifier

Reply via email to