Hi Luoc, Thanks for your reply. I have tried the same data in parquet format and results are really surprising. The query on that huge data which took 48 mins on json files , returned me the results in 11 sec in paquet format. I am trying a few more scenarios. Thanks
regards Prabhakar On Mon, Jul 11, 2022 at 3:47 PM luoc <[email protected]> wrote: > Hello Prabhakar, > > For the first question, I recommend you try the performance tool like the > "nmon" to check the cost of the machine's CPU. > > I think you can get the results as how many nodes, once you scale the > number of drillbits to 2 or 3, because you can look at the velocity changes. > > You can try to reduce the file size to 512 MB, but we can't make sure > there's a big impact, but we can reduce the cost of the JVM’ GC. > > Parquet is good for data analysis but not for data queries, because the > Parquet is the columnar storage. > > If you like using the "select * from table1", then the Parquet is not a > good idea. > > If you like using the `select max(f1), min(f2) from table1` query text, > then the Parquet is a great solution. > > Next, you may have to choose between CTAS or direct query costs. > > > - luoc > > > 2022年7月10日 下午1:22,Prabhakar Bhosale <[email protected]> 写道: > > > > Dear Luoc, > > Thanks for the insights. This is just a week's data. Production will > have 15 times more data. So inline with it, I have following queries > > > > 1. Is there any template or calculator which will help me size the > production server (CPU, memory and IO) based on size of data? > > 2. For such a huge size of data, what are the best practices to be > followed to store and retrieve the data? > > 2. What should be the optimal size of the file? Currently the > uncompressed size of the file is 2GB. So how do we balance between the > number of files and file size? > > 3. Do you think the parquet format will perform better than JSON? > > 4. Is there any way in Drill to detect the "File Create" event and then > convert JSON to parquet using CTAS? > > > > Thanks And Regards > > Prabhakar > > > > On Sat, Jul 9, 2022 at 8:41 PM luoc <[email protected] <mailto: > [email protected]>> wrote: > > > > Hello Prabhakar, > > > > I will present my check process and hope to give you advice : > > > > 1. I imported the file in your attachment using the `View` button on the > right side of the `Profile` page. > > > > 2. The fragment profile record that: the major fragment (02-xx-xx) cost > about 45min+. > > > > 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader) > cost the most of the time. > > > > 4. Each minor fragment reads nearly 0.12 billion records. Killer ! > > > > As a result, three JSON readers read a total of 338,398,798 records. > > > > And then, your JSON files are in GZ compression format, with a total of > 297, meaning the Drill need a lot of CPUs to decompress. > > > > Simply understand that your hardware resources are bottlenecks and can't > use faster to query large-scale records, and recommend scale-out nodes and > using distributed clusters. > > > > - luoc > > > > > > > > > > > >> 2022年7月9日 上午1:01,Prabhakar Bhosale <[email protected] <mailto: > [email protected]>> 写道: > >> > >> the > > > >
