Good job. Any new feedback is welcome.
> On Jul 11, 2022, at 22:53, Prabhakar Bhosale <[email protected]> wrote: > > Hi Luoc, > Thanks for your reply. I have tried the same data in parquet format and > results are really surprising. The query on that huge data which took 48 > mins on json files , returned me the results in 11 sec in paquet format. > I am trying a few more scenarios. Thanks > > regards > Prabhakar > >> On Mon, Jul 11, 2022 at 3:47 PM luoc <[email protected]> wrote: >> >> Hello Prabhakar, >> >> For the first question, I recommend you try the performance tool like the >> "nmon" to check the cost of the machine's CPU. >> >> I think you can get the results as how many nodes, once you scale the >> number of drillbits to 2 or 3, because you can look at the velocity changes. >> >> You can try to reduce the file size to 512 MB, but we can't make sure >> there's a big impact, but we can reduce the cost of the JVM’ GC. >> >> Parquet is good for data analysis but not for data queries, because the >> Parquet is the columnar storage. >> >> If you like using the "select * from table1", then the Parquet is not a >> good idea. >> >> If you like using the `select max(f1), min(f2) from table1` query text, >> then the Parquet is a great solution. >> >> Next, you may have to choose between CTAS or direct query costs. >> >> >> - luoc >> >>> 2022年7月10日 下午1:22,Prabhakar Bhosale <[email protected]> 写道: >>> >>> Dear Luoc, >>> Thanks for the insights. This is just a week's data. Production will >> have 15 times more data. So inline with it, I have following queries >>> >>> 1. Is there any template or calculator which will help me size the >> production server (CPU, memory and IO) based on size of data? >>> 2. For such a huge size of data, what are the best practices to be >> followed to store and retrieve the data? >>> 2. What should be the optimal size of the file? Currently the >> uncompressed size of the file is 2GB. So how do we balance between the >> number of files and file size? >>> 3. Do you think the parquet format will perform better than JSON? >>> 4. Is there any way in Drill to detect the "File Create" event and then >> convert JSON to parquet using CTAS? >>> >>> Thanks And Regards >>> Prabhakar >>> >>> On Sat, Jul 9, 2022 at 8:41 PM luoc <[email protected] <mailto: >> [email protected]>> wrote: >>> >>> Hello Prabhakar, >>> >>> I will present my check process and hope to give you advice : >>> >>> 1. I imported the file in your attachment using the `View` button on the >> right side of the `Profile` page. >>> >>> 2. The fragment profile record that: the major fragment (02-xx-xx) cost >> about 45min+. >>> >>> 3. The 02-xx-xx phase used 3 parallels, and the json-scan (JSON Reader) >> cost the most of the time. >>> >>> 4. Each minor fragment reads nearly 0.12 billion records. Killer ! >>> >>> As a result, three JSON readers read a total of 338,398,798 records. >>> >>> And then, your JSON files are in GZ compression format, with a total of >> 297, meaning the Drill need a lot of CPUs to decompress. >>> >>> Simply understand that your hardware resources are bottlenecks and can't >> use faster to query large-scale records, and recommend scale-out nodes and >> using distributed clusters. >>> >>> - luoc >>> >>> >>> >>> >>> >>>> 2022年7月9日 上午1:01,Prabhakar Bhosale <[email protected] <mailto: >> [email protected]>> 写道: >>>> >>>> the >>> >> >> >
