Re: Which perform better JSON or convert JSON to parquet format ?
One gotcha with JSON vs parquet is that there's a longstanding bug that causes errors when trying to read from Parquet files containing 0 rows. For cases where we're converting from datasets that might be empty, we use JSON, and for everything else, Parquet.
Re: Which perform better JSON or convert JSON to parquet format ?
Hello, you can also use Python wrapper pyarrow to create nested/json-like structures in Python. For example using `pyarrow.array([[1, 2], [1], None, [12, 23, 23]])`. Cheers Uwe On Tue, Jun 12, 2018, at 4:45 PM, Lee, David wrote: > Python supports tabular structures using pyarrow. > > https://arrow.apache.org/docs/python/generated/pyarrow.schema.html > > For nested structures like JSON you have to use C++ (parquet-cpp) > > https://github.com/apache/parquet-cpp > > We need more APIs developed to create nested JSON.. > > -Original Message- > From: Divya Gehlot [mailto:divya.htco...@gmail.com] > Sent: Tuesday, June 12, 2018 5:25 AM > To: user@drill.apache.org > Subject: Re: Which perform better JSON or convert JSON to parquet format ? > > [EXTERNAL EMAIL] > > > Hi David, > How to create the schema first using parquet library ? > Can you please give an example? > > Thanks, > Divya > > On Tue, 12 Jun 2018 at 00:03, Lee, David wrote: > > > Parquet is faster especially if you are only looking for a subset of > > json objects. Every JSON key / array is treated as a column. > > > > With that said creating parquet from JSON is not bullet proof if you > > have really complex json which may have NULL values or many optional > > keys (Drill can't figure out what data type a NULL JSON value is and > > has trouble merging optional keys after sampling the first 20,000? > > records) > > > > If you are creating parquet you should be using the parquet libraries > > to define a consistent schema first. I've pretty much given up trying > > to create parquet from json which always ends in index out of bound > > (server > > crashing) errors when trying to query parquet. > > > > -Original Message----- > > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > > Sent: Monday, June 11, 2018 4:47 AM > > To: user > > Subject: Re: Which perform better JSON or convert JSON to parquet format ? > > > > [EXTERNAL EMAIL] > > > > > > Yes. Drill is good at JSON. > > > > But Parquet will be faster during a scan. > > > > Faster may be better. Or other things may be more important. > > > > You have to decide what is important to you. The great virtue of drill > > is that you have the choice. > > > > > > > > On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot > > > > wrote: > > > > > Thanks to all for your opinions ! > > > As Drill has been popularised as complex JSON reader as compare to > > > other tools in space . > > > Was wondering does drill works better for JSON rather than parquet. > > > > > > > > > This message may contain information that is confidential or privileged. > > If you are not the intended recipient, please advise the sender > > immediately and delete this message. See > > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers > > for further information. Please refer to > > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for > > more information about BlackRock’s Privacy Policy. > > > > For a list of BlackRock's office addresses worldwide, see > > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. > > > > © 2018 BlackRock, Inc. All rights reserved. > >
RE: Which perform better JSON or convert JSON to parquet format ?
Python supports tabular structures using pyarrow. https://arrow.apache.org/docs/python/generated/pyarrow.schema.html For nested structures like JSON you have to use C++ (parquet-cpp) https://github.com/apache/parquet-cpp We need more APIs developed to create nested JSON.. -Original Message- From: Divya Gehlot [mailto:divya.htco...@gmail.com] Sent: Tuesday, June 12, 2018 5:25 AM To: user@drill.apache.org Subject: Re: Which perform better JSON or convert JSON to parquet format ? [EXTERNAL EMAIL] Hi David, How to create the schema first using parquet library ? Can you please give an example? Thanks, Divya On Tue, 12 Jun 2018 at 00:03, Lee, David wrote: > Parquet is faster especially if you are only looking for a subset of > json objects. Every JSON key / array is treated as a column. > > With that said creating parquet from JSON is not bullet proof if you > have really complex json which may have NULL values or many optional > keys (Drill can't figure out what data type a NULL JSON value is and > has trouble merging optional keys after sampling the first 20,000? > records) > > If you are creating parquet you should be using the parquet libraries > to define a consistent schema first. I've pretty much given up trying > to create parquet from json which always ends in index out of bound > (server > crashing) errors when trying to query parquet. > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Monday, June 11, 2018 4:47 AM > To: user > Subject: Re: Which perform better JSON or convert JSON to parquet format ? > > [EXTERNAL EMAIL] > > > Yes. Drill is good at JSON. > > But Parquet will be faster during a scan. > > Faster may be better. Or other things may be more important. > > You have to decide what is important to you. The great virtue of drill > is that you have the choice. > > > > On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot > > wrote: > > > Thanks to all for your opinions ! > > As Drill has been popularised as complex JSON reader as compare to > > other tools in space . > > Was wondering does drill works better for JSON rather than parquet. > > > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender > immediately and delete this message. See > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers > for further information. Please refer to > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for > more information about BlackRock’s Privacy Policy. > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. > > © 2018 BlackRock, Inc. All rights reserved. >
Re: Which perform better JSON or convert JSON to parquet format ?
Hi David, How to create the schema first using parquet library ? Can you please give an example? Thanks, Divya On Tue, 12 Jun 2018 at 00:03, Lee, David wrote: > Parquet is faster especially if you are only looking for a subset of json > objects. Every JSON key / array is treated as a column. > > With that said creating parquet from JSON is not bullet proof if you have > really complex json which may have NULL values or many optional keys (Drill > can't figure out what data type a NULL JSON value is and has trouble > merging optional keys after sampling the first 20,000? records) > > If you are creating parquet you should be using the parquet libraries to > define a consistent schema first. I've pretty much given up trying to > create parquet from json which always ends in index out of bound (server > crashing) errors when trying to query parquet. > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Monday, June 11, 2018 4:47 AM > To: user > Subject: Re: Which perform better JSON or convert JSON to parquet format ? > > [EXTERNAL EMAIL] > > > Yes. Drill is good at JSON. > > But Parquet will be faster during a scan. > > Faster may be better. Or other things may be more important. > > You have to decide what is important to you. The great virtue of drill is > that you have the choice. > > > > On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot > wrote: > > > Thanks to all for your opinions ! > > As Drill has been popularised as complex JSON reader as compare to > > other tools in space . > > Was wondering does drill works better for JSON rather than parquet. > > > > > This message may contain information that is confidential or privileged. > If you are not the intended recipient, please advise the sender immediately > and delete this message. See > http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for > further information. Please refer to > http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for > more information about BlackRock’s Privacy Policy. > > For a list of BlackRock's office addresses worldwide, see > http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. > > © 2018 BlackRock, Inc. All rights reserved. >
RE: Which perform better JSON or convert JSON to parquet format ?
Parquet is faster especially if you are only looking for a subset of json objects. Every JSON key / array is treated as a column. With that said creating parquet from JSON is not bullet proof if you have really complex json which may have NULL values or many optional keys (Drill can't figure out what data type a NULL JSON value is and has trouble merging optional keys after sampling the first 20,000? records) If you are creating parquet you should be using the parquet libraries to define a consistent schema first. I've pretty much given up trying to create parquet from json which always ends in index out of bound (server crashing) errors when trying to query parquet. -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, June 11, 2018 4:47 AM To: user Subject: Re: Which perform better JSON or convert JSON to parquet format ? [EXTERNAL EMAIL] Yes. Drill is good at JSON. But Parquet will be faster during a scan. Faster may be better. Or other things may be more important. You have to decide what is important to you. The great virtue of drill is that you have the choice. On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot wrote: > Thanks to all for your opinions ! > As Drill has been popularised as complex JSON reader as compare to > other tools in space . > Was wondering does drill works better for JSON rather than parquet. > This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/en-us/about-us/contacts-locations. © 2018 BlackRock, Inc. All rights reserved.
Re: Which perform better JSON or convert JSON to parquet format ?
Yes. Drill is good at JSON. But Parquet will be faster during a scan. Faster may be better. Or other things may be more important. You have to decide what is important to you. The great virtue of drill is that you have the choice. On Mon, Jun 11, 2018 at 11:06 AM Divya Gehlot wrote: > Thanks to all for your opinions ! > As Drill has been popularised as complex JSON reader as compare to other > tools in space . > Was wondering does drill works better for JSON rather than parquet. >
Re: Which perform better JSON or convert JSON to parquet format ?
Thanks to all for your opinions ! As Drill has been popularised as complex JSON reader as compare to other tools in space . Was wondering does drill works better for JSON rather than parquet.
Re: Which perform better JSON or convert JSON to parquet format ?
I am going to play the contrarian here. Parquet is not *always* faster than JSON. The (almost unique) case where it is better to leave data as JSON (or whatever) is when the average number of times that a file is read is equal to or less than roughly 1. The point is that to convert read the files n times in Parquet format, you have to read the JSON once, write the Parquet and then read the Parquet n times. The cost of reading the JSON n times is simply n times the cost of reading the JSON (neglecting caches and such). As such, if n <= 1+epsilon, JSON wins. This isn't as strange a case as it might seem. For security logs, it is common that the files are never read until you need them. That means that n is nearly zero on average and n << 1 in any case. For incoming data, it is common that there is an immediate transformation into an alternative form. That might be pruning data or elaborating or aggregating. The point is that the original data need not ever be re-written into Parquet format since it is only ever read once. Transforming the format would wast time and space. The other case of importance is where the read time is near zero for JSON. Transforming to any other format will take near zero time and reading from any other format will also be near zero. The win for transforming will be near zero as well. So Having said all that, I agree that reading from Parquet will almost certainly be faster and combining a bunch of small JSON files together into a larger parquet file will be a real boon for frequently read data. It just that faster isn't always better if there is a fixed cost. On Mon, Jun 11, 2018 at 6:42 AM Padma Penumarthy wrote: > Yes, parquet is always better for multiple reasons. With JSON, we have to > read the whole file > from a single reader thread and have to parse to read individual columns. > Parquet compresses and encodes data on disk. So, we read much less data > from disk. > Drill can read individual columns with in each rowgroup in parallel. Also, > we can leverage > features like filter pushdown, partition pruning, metadata cache for > better query performance. > > Thanks > Padma > > > On Jun 10, 2018, at 8:22 PM, Abhishek Girish wrote: > > > > I would suggest converting the JSON files to parquet for better > > performance. JSON supports a more free form data model, so that's a > > trade-off you need to consider, in my opinion. > > On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot > > wrote: > > > >> Hi, > >> I am looking for the advise regarding the performance for below : > >> 1. keep the JSON as is > >> 2. Convert the JSON file to parquet files > >> > >> My JSON files data is not in fixed format and file size varies from 10 > KB > >> to 1 MB. > >> > >> Appreciate the community users advise on above ! > >> > >> > >> Thanks, > >> Divya > >> > >
Re: Which perform better JSON or convert JSON to parquet format ?
Yes, parquet is always better for multiple reasons. With JSON, we have to read the whole file from a single reader thread and have to parse to read individual columns. Parquet compresses and encodes data on disk. So, we read much less data from disk. Drill can read individual columns with in each rowgroup in parallel. Also, we can leverage features like filter pushdown, partition pruning, metadata cache for better query performance. Thanks Padma > On Jun 10, 2018, at 8:22 PM, Abhishek Girish wrote: > > I would suggest converting the JSON files to parquet for better > performance. JSON supports a more free form data model, so that's a > trade-off you need to consider, in my opinion. > On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot > wrote: > >> Hi, >> I am looking for the advise regarding the performance for below : >> 1. keep the JSON as is >> 2. Convert the JSON file to parquet files >> >> My JSON files data is not in fixed format and file size varies from 10 KB >> to 1 MB. >> >> Appreciate the community users advise on above ! >> >> >> Thanks, >> Divya >>
Re: Which perform better JSON or convert JSON to parquet format ?
I would suggest converting the JSON files to parquet for better performance. JSON supports a more free form data model, so that's a trade-off you need to consider, in my opinion. On Sun, Jun 10, 2018 at 8:08 PM Divya Gehlot wrote: > Hi, > I am looking for the advise regarding the performance for below : > 1. keep the JSON as is > 2. Convert the JSON file to parquet files > > My JSON files data is not in fixed format and file size varies from 10 KB > to 1 MB. > > Appreciate the community users advise on above ! > > > Thanks, > Divya >