Dear Luoc, Thanks for the insights. This is just a week's data. Production will have 15 times more data. So inline with it, I have following queries
1. Is there any template or calculator which will help me size the production server (CPU, memory and IO) based on size of data? 2. For such a huge size of data, what are the best practices to be followed to store and retrieve the data? 2. What should be the optimal size of the file? Currently the uncompressed size of the file is 2GB. So how do we balance between the number of files and file size? 3. Do you think the parquet format will perform better than JSON? 4. Is there any way in Drill to detect the "File Create" event and then convert JSON to parquet using CTAS? Thanks And Regards Prabhakar On Sat, Jul 9, 2022 at 8:41 PM luoc <[email protected]> wrote: > > Hello Prabhakar, > > I will present my check process and hope to give you advice : > > 1. I imported the file in your attachment using the `View` button on the > right side of the `Profile` page. > > 2. The fragment profile record that: the major fragment (02-xx-xx) cost > about 45min+. > > 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader) > cost the most of the time. > > 4. Each minor fragment reads nearly 0.12 billion records. Killer ! > > As a result, three JSON readers read a total of 338,398,798 records. > > And then, your JSON files are in GZ compression format, with a total of > 297, meaning the Drill need a lot of CPUs to decompress. > > Simply understand that your hardware resources are bottlenecks and can't > use faster to query large-scale records, and recommend scale-out nodes and > using distributed clusters. > > - luoc > > > > 2022年7月9日 上午1:01,Prabhakar Bhosale <[email protected]> 写道: > > the > > >
