Re: Query performance

luoc Wed, 13 Jul 2022 18:19:41 -0700


Good job. Any new feedback is welcome.


> On Jul 11, 2022, at 22:53, Prabhakar Bhosale <[email protected]> wrote:
> 
> Hi  Luoc,
> Thanks for your reply. I have tried the same data in parquet format and
> results are really surprising. The query on that huge data which took 48
> mins on json files , returned me the results in 11 sec in paquet format.
> I am trying a few more scenarios. Thanks
> 
> regards
> Prabhakar
> 
>> On Mon, Jul 11, 2022 at 3:47 PM luoc <[email protected]> wrote:
>> 
>> Hello Prabhakar,
>> 
>> For the first question, I recommend you try the performance tool like the
>> "nmon" to check the cost of the machine's CPU.
>> 
>> I think you can get the results as how many nodes, once you scale the
>> number of drillbits to 2 or 3, because you can look at the velocity changes.
>> 
>> You can try to reduce the file size to 512 MB, but we can't make sure
>> there's a big impact, but we can reduce the cost of the JVM’ GC.
>> 
>> Parquet is good for data analysis but not for data queries, because the
>> Parquet is the columnar storage.
>> 
>> If you like using the "select * from table1", then the Parquet is not a
>> good idea.
>> 
>> If you like using the `select max(f1), min(f2) from table1` query text,
>> then the Parquet is a great solution.
>> 
>> Next, you may have to choose between CTAS or direct query costs.
>> 
>> 
>> - luoc
>> 
>>> 2022年7月10日 下午1:22，Prabhakar Bhosale <[email protected]> 写道：
>>> 
>>> Dear Luoc,
>>> Thanks for the insights. This is just a week's data. Production will
>> have 15 times more data. So inline with it, I have following queries
>>> 
>>> 1. Is there any template or calculator which will help me size the
>> production server (CPU, memory and IO) based on size of data?
>>> 2. For such a huge size of data, what are the best practices to be
>> followed to store and retrieve the data?
>>> 2. What should be the optimal size of the file? Currently the
>> uncompressed size of the file is 2GB. So how do we balance between the
>> number of files and file size?
>>> 3. Do you think the parquet format will perform better than JSON?
>>> 4. Is there any way in Drill to detect the "File Create" event and then
>> convert JSON to parquet using CTAS?
>>> 
>>> Thanks And Regards
>>> Prabhakar
>>> 
>>> On Sat, Jul 9, 2022 at 8:41 PM luoc <[email protected] <mailto:
>> [email protected]>> wrote:
>>> 
>>> Hello Prabhakar,
>>> 
>>> I will present my check process and hope to give you advice :
>>> 
>>> 1. I imported the file in your attachment using the `View` button on the
>> right side of the `Profile` page.
>>> 
>>> 2. The fragment profile record that: the major fragment (02-xx-xx) cost
>> about 45min+.
>>> 
>>> 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader)
>> cost the most of the time.
>>> 
>>> 4. Each minor fragment reads nearly 0.12 billion records. Killer !
>>> 
>>> As a result, three JSON readers read a total of 338,398,798 records.
>>> 
>>> And then, your JSON files are in GZ compression format, with a total of
>> 297, meaning the Drill need a lot of CPUs to decompress.
>>> 
>>> Simply understand that your hardware resources are bottlenecks and can't
>> use faster to query large-scale records, and recommend scale-out nodes and
>> using distributed clusters.
>>> 
>>> - luoc
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 2022年7月9日 上午1:01，Prabhakar Bhosale <[email protected] <mailto:
>> [email protected]>> 写道：
>>>> 
>>>> the
>>> 
>> 
>> 
>

Re: Query performance

Reply via email to