Hi  Luoc,
Thanks for your reply. I have tried the same data in parquet format and
results are really surprising. The query on that huge data which took 48
mins on json files , returned me the results in 11 sec in paquet format.
I am trying a few more scenarios. Thanks

regards
Prabhakar

On Mon, Jul 11, 2022 at 3:47 PM luoc <[email protected]> wrote:

> Hello Prabhakar,
>
> For the first question, I recommend you try the performance tool like the
> "nmon" to check the cost of the machine's CPU.
>
> I think you can get the results as how many nodes, once you scale the
> number of drillbits to 2 or 3, because you can look at the velocity changes.
>
> You can try to reduce the file size to 512 MB, but we can't make sure
> there's a big impact, but we can reduce the cost of the JVM’ GC.
>
> Parquet is good for data analysis but not for data queries, because the
> Parquet is the columnar storage.
>
> If you like using the "select * from table1", then the Parquet is not a
> good idea.
>
> If you like using the `select max(f1), min(f2) from table1` query text,
> then the Parquet is a great solution.
>
> Next, you may have to choose between CTAS or direct query costs.
>
>
> - luoc
>
> > 2022年7月10日 下午1:22,Prabhakar Bhosale <[email protected]> 写道:
> >
> > Dear Luoc,
> > Thanks for the insights. This is just a week's data. Production will
> have 15 times more data. So inline with it, I have following queries
> >
> > 1. Is there any template or calculator which will help me size the
> production server (CPU, memory and IO) based on size of data?
> > 2. For such a huge size of data, what are the best practices to be
> followed to store and retrieve the data?
> > 2. What should be the optimal size of the file? Currently the
> uncompressed size of the file is 2GB. So how do we balance between the
> number of files and file size?
> > 3. Do you think the parquet format will perform better than JSON?
> > 4. Is there any way in Drill to detect the "File Create" event and then
> convert JSON to parquet using CTAS?
> >
> > Thanks And Regards
> > Prabhakar
> >
> > On Sat, Jul 9, 2022 at 8:41 PM luoc <[email protected] <mailto:
> [email protected]>> wrote:
> >
> > Hello Prabhakar,
> >
> > I will present my check process and hope to give you advice :
> >
> > 1. I imported the file in your attachment using the `View` button on the
> right side of the `Profile` page.
> >
> > 2. The fragment profile record that: the major fragment (02-xx-xx) cost
> about 45min+.
> >
> > 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader)
> cost the most of the time.
> >
> > 4. Each minor fragment reads nearly 0.12 billion records. Killer !
> >
> > As a result, three JSON readers read a total of 338,398,798 records.
> >
> > And then, your JSON files are in GZ compression format, with a total of
> 297, meaning the Drill need a lot of CPUs to decompress.
> >
> > Simply understand that your hardware resources are bottlenecks and can't
> use faster to query large-scale records, and recommend scale-out nodes and
> using distributed clusters.
> >
> > - luoc
> >
> >
> >
> >
> >
> >> 2022年7月9日 上午1:01,Prabhakar Bhosale <[email protected] <mailto:
> [email protected]>> 写道:
> >>
> >> the
> >
>
>

Reply via email to