Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-15 Thread Dave Challis
One gotcha with JSON vs parquet is that there's a longstanding bug that
causes errors when trying to read from Parquet files containing 0 rows.

For cases where we're converting from datasets that might be empty, we use
JSON, and for everything else, Parquet.


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-14 Thread Uwe L. Korn
Hello,

you can also use Python wrapper pyarrow to create nested/json-like structures 
in Python. For example using `pyarrow.array([[1, 2], [1], None, [12, 23, 
23]])`. 

Cheers
Uwe

On Tue, Jun 12, 2018, at 4:45 PM, Lee, David wrote:
> Python supports tabular structures using pyarrow.
> 
> https://arrow.apache.org/docs/python/generated/pyarrow.schema.html
> 
> For nested structures like JSON you have to use C++ (parquet-cpp)
> 
> https://github.com/apache/parquet-cpp
> 
> We need more APIs developed to create nested JSON..
> 
> -Original Message-
> From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
> Sent: Tuesday, June 12, 2018 5:25 AM
> To: user@drill.apache.org
> Subject: Re: Which perform better JSON or convert JSON to parquet format ?
> 
> [EXTERNAL EMAIL]
> 
> 
> Hi David,
> How to create the schema first using parquet library ?
> Can you please give an example?
> 
> Thanks,
> Divya
> 
> On Tue, 12 Jun 2018 at 00:03, Lee, David  wrote:
> 
> > Parquet is faster especially if you are only looking for a subset of 
> > json objects. Every JSON key / array is treated as a column.
> >
> > With that said creating parquet from JSON is not bullet proof if you 
> > have really complex json which may have NULL values or many optional 
> > keys (Drill can't figure out what data type a NULL JSON value is and 
> > has trouble merging optional keys after sampling the first 20,000? 
> > records)
> >
> > If you are creating parquet you should be using the parquet libraries 
> > to define a consistent schema first. I've pretty much given up trying 
> > to create parquet from json which always ends in index out of bound 
> > (server
> > crashing) errors when trying to query parquet.
> >
> > -Original Message-----
> > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> > Sent: Monday, June 11, 2018 4:47 AM
> > To: user 
> > Subject: Re: Which perform better JSON or convert JSON to parquet format ?
> >
> > [EXTERNAL EMAIL]
> >
> >
> > Yes. Drill is good at JSON.
> >
> > But Parquet will be faster during a scan.
> >
> > Faster may be better. Or other things may be more important.
> >
> > You have to decide what is important to you. The great virtue of drill 
> > is that you have the choice.
> >
> >
> >
> > On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot 
> > 
> > wrote:
> >
> > > Thanks to all for  your opinions !
> > > As Drill has been popularised  as complex JSON reader as compare to 
> > > other tools in space .
> > > Was wondering does drill works better for JSON rather than parquet.
> > >
> >
> >
> > This message may contain information that is confidential or privileged.
> > If you are not the intended recipient, please advise the sender 
> > immediately and delete this message. See 
> > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers 
> > for further information.  Please refer to 
> > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for 
> > more information about BlackRock’s Privacy Policy.
> >
> > For a list of BlackRock's office addresses worldwide, see 
> > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
> >
> > © 2018 BlackRock, Inc. All rights reserved.
> >


RE: Which perform better JSON or convert JSON to parquet format ?

2018-06-12 Thread Lee, David
Python supports tabular structures using pyarrow.

https://arrow.apache.org/docs/python/generated/pyarrow.schema.html

For nested structures like JSON you have to use C++ (parquet-cpp)

https://github.com/apache/parquet-cpp

We need more APIs developed to create nested JSON..

-Original Message-
From: Divya Gehlot [mailto:divya.htco...@gmail.com] 
Sent: Tuesday, June 12, 2018 5:25 AM
To: user@drill.apache.org
Subject: Re: Which perform better JSON or convert JSON to parquet format ?

[EXTERNAL EMAIL]


Hi David,
How to create the schema first using parquet library ?
Can you please give an example?

Thanks,
Divya

On Tue, 12 Jun 2018 at 00:03, Lee, David  wrote:

> Parquet is faster especially if you are only looking for a subset of 
> json objects. Every JSON key / array is treated as a column.
>
> With that said creating parquet from JSON is not bullet proof if you 
> have really complex json which may have NULL values or many optional 
> keys (Drill can't figure out what data type a NULL JSON value is and 
> has trouble merging optional keys after sampling the first 20,000? 
> records)
>
> If you are creating parquet you should be using the parquet libraries 
> to define a consistent schema first. I've pretty much given up trying 
> to create parquet from json which always ends in index out of bound 
> (server
> crashing) errors when trying to query parquet.
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Monday, June 11, 2018 4:47 AM
> To: user 
> Subject: Re: Which perform better JSON or convert JSON to parquet format ?
>
> [EXTERNAL EMAIL]
>
>
> Yes. Drill is good at JSON.
>
> But Parquet will be faster during a scan.
>
> Faster may be better. Or other things may be more important.
>
> You have to decide what is important to you. The great virtue of drill 
> is that you have the choice.
>
>
>
> On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot 
> 
> wrote:
>
> > Thanks to all for  your opinions !
> > As Drill has been popularised  as complex JSON reader as compare to 
> > other tools in space .
> > Was wondering does drill works better for JSON rather than parquet.
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender 
> immediately and delete this message. See 
> http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers 
> for further information.  Please refer to 
> http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for 
> more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see 
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-12 Thread Divya Gehlot
Hi David,
How to create the schema first using parquet library ?
Can you please give an example?

Thanks,
Divya

On Tue, 12 Jun 2018 at 00:03, Lee, David  wrote:

> Parquet is faster especially if you are only looking for a subset of json
> objects. Every JSON key / array is treated as a column.
>
> With that said creating parquet from JSON is not bullet proof if you have
> really complex json which may have NULL values or many optional keys (Drill
> can't figure out what data type a NULL JSON value is and has trouble
> merging optional keys after sampling the first 20,000? records)
>
> If you are creating parquet you should be using the parquet libraries to
> define a consistent schema first. I've pretty much given up trying to
> create parquet from json which always ends in index out of bound (server
> crashing) errors when trying to query parquet.
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Monday, June 11, 2018 4:47 AM
> To: user 
> Subject: Re: Which perform better JSON or convert JSON to parquet format ?
>
> [EXTERNAL EMAIL]
>
>
> Yes. Drill is good at JSON.
>
> But Parquet will be faster during a scan.
>
> Faster may be better. Or other things may be more important.
>
> You have to decide what is important to you. The great virtue of drill is
> that you have the choice.
>
>
>
> On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot 
> wrote:
>
> > Thanks to all for  your opinions !
> > As Drill has been popularised  as complex JSON reader as compare to
> > other tools in space .
> > Was wondering does drill works better for JSON rather than parquet.
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for
> more information about BlackRock’s Privacy Policy.
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
>
> © 2018 BlackRock, Inc. All rights reserved.
>


RE: Which perform better JSON or convert JSON to parquet format ?

2018-06-11 Thread Lee, David
Parquet is faster especially if you are only looking for a subset of json 
objects. Every JSON key / array is treated as a column.

With that said creating parquet from JSON is not bullet proof if you have 
really complex json which may have NULL values or many optional keys (Drill 
can't figure out what data type a NULL JSON value is and has trouble merging 
optional keys after sampling the first 20,000? records)

If you are creating parquet you should be using the parquet libraries to define 
a consistent schema first. I've pretty much given up trying to create parquet 
from json which always ends in index out of bound (server crashing) errors when 
trying to query parquet.

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, June 11, 2018 4:47 AM
To: user 
Subject: Re: Which perform better JSON or convert JSON to parquet format ?

[EXTERNAL EMAIL]


Yes. Drill is good at JSON.

But Parquet will be faster during a scan.

Faster may be better. Or other things may be more important.

You have to decide what is important to you. The great virtue of drill is that 
you have the choice.



On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot 
wrote:

> Thanks to all for  your opinions !
> As Drill has been popularised  as complex JSON reader as compare to 
> other tools in space .
> Was wondering does drill works better for JSON rather than parquet.
>


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-11 Thread Ted Dunning
Yes. Drill is good at JSON.

But Parquet will be faster during a scan.

Faster may be better. Or other things may be more important.

You have to decide what is important to you. The great virtue of drill is
that you have the choice.



On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot 
wrote:

> Thanks to all for  your opinions !
> As Drill has been popularised  as complex JSON reader as compare to other
> tools in space .
> Was wondering does drill works better for JSON rather than parquet.
>


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-11 Thread Divya Gehlot
Thanks to all for  your opinions !
As Drill has been popularised  as complex JSON reader as compare to other
tools in space .
Was wondering does drill works better for JSON rather than parquet.


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-11 Thread Ted Dunning
I am going to play the contrarian here.

Parquet is not *always* faster than JSON.

The (almost unique) case where it is better to leave data as JSON (or
whatever) is when the average number of times that a file is read is equal
to or less than roughly 1.

The point is that to convert read the files n times in Parquet format, you
have to read the JSON once, write the Parquet and then read the Parquet n
times. The cost of reading the JSON n times is simply n times the cost of
reading the JSON (neglecting caches and such). As such, if n <= 1+epsilon,
JSON wins.

This isn't as strange a case as it might seem. For security logs, it is
common that the files are never read until you need them. That means that n
is nearly zero on average and n << 1 in any case. For incoming data, it is
common that there is an immediate transformation into an alternative form.
That might be pruning data or elaborating or aggregating. The point is that
the original data need not ever be re-written into Parquet format since it
is only ever read once. Transforming the format would wast time and space.

The other case of importance is where the read time is near zero for JSON.
Transforming to any other format will take near zero time and reading from
any other format will also be near zero. The win for transforming will be
near zero as well.

So

Having said all that, I agree that reading from Parquet will almost
certainly be faster and combining a bunch of small JSON files together into
a larger parquet file will be a real boon for frequently read data. It just
that faster isn't always better if there is a fixed cost.



On Mon, Jun 11, 2018 at 6:42 AM Padma Penumarthy 
wrote:

> Yes, parquet is always better for multiple reasons. With JSON, we have to
> read the whole file
> from a single reader thread and have to parse to read individual columns.
> Parquet compresses and encodes data on disk. So, we read much less data
> from disk.
> Drill can read individual columns with in each rowgroup in parallel. Also,
> we can leverage
> features like filter pushdown, partition pruning, metadata cache for
> better query performance.
>
> Thanks
> Padma
>
> > On Jun 10, 2018, at 8:22 PM, Abhishek Girish  wrote:
> >
> > I would suggest converting the JSON files to parquet for better
> > performance. JSON supports a more free form data model, so that's a
> > trade-off you need to consider, in my opinion.
> > On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot 
> > wrote:
> >
> >> Hi,
> >> I am looking for the advise regarding the performance for below :
> >> 1. keep the JSON as is
> >> 2. Convert the JSON file to parquet files
> >>
> >> My JSON files data is not in fixed format and  file size varies from 10
> KB
> >> to 1 MB.
> >>
> >> Appreciate the community users advise on above !
> >>
> >>
> >> Thanks,
> >> Divya
> >>
>
>


Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-10 Thread Padma Penumarthy
Yes, parquet is always better for multiple reasons. With JSON, we have to read 
the whole file
from a single reader thread and have to parse to read individual columns. 
Parquet compresses and encodes data on disk. So, we read much less data from 
disk.
Drill can read individual columns with in each rowgroup in parallel. Also, we 
can leverage
features like filter pushdown, partition pruning, metadata cache for better 
query performance. 

Thanks
Padma

> On Jun 10, 2018, at 8:22 PM, Abhishek Girish  wrote:
> 
> I would suggest converting the JSON files to parquet for better
> performance. JSON supports a more free form data model, so that's a
> trade-off you need to consider, in my opinion.
> On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot 
> wrote:
> 
>> Hi,
>> I am looking for the advise regarding the performance for below :
>> 1. keep the JSON as is
>> 2. Convert the JSON file to parquet files
>> 
>> My JSON files data is not in fixed format and  file size varies from 10 KB
>> to 1 MB.
>> 
>> Appreciate the community users advise on above !
>> 
>> 
>> Thanks,
>> Divya
>> 



Re: Which perform better JSON or convert JSON to parquet format ?

2018-06-10 Thread Abhishek Girish
I would suggest converting the JSON files to parquet for better
performance. JSON supports a more free form data model, so that's a
trade-off you need to consider, in my opinion.
On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot 
wrote:

> Hi,
> I am looking for the advise regarding the performance for below :
> 1. keep the JSON as is
> 2. Convert the JSON file to parquet files
>
> My JSON files data is not in fixed format and  file size varies from 10 KB
> to 1 MB.
>
> Appreciate the community users advise on above !
>
>
> Thanks,
> Divya
>