Hi Vishwajeet, Welcome to the Drill community. As it turns out, our mailing list does not forward images. But, fortunately, the moderation software preserved the images so I was able to find them. Let me tackle your questions one by one.
Like all generalizations, saying "Drill needs lots of memory" is relative. The statement applies to a production system, running against large files, with many concurrent users. It probably does not apply to your local machine running a few sample queries. What drives memory usage? It is not just file size. It is the buffered size. If you scan 1TB of data with a simple query with only a WHERE clause, Drill will use very little memory. But, if you sort the 1TB of data, Drill will obviously need lots of memory to perform the sort. For sort (and several other operations), if there is not enough memory, Drill will spill to disk, which is slow. (At least three IOs for each block of data instead of just one.) Second, the variable you used to set memory: JAVA_TOOL_OPTIONS -Xmx8192m Is not the documented way to set memory. See [1] for the preferred approach. Looks like your approach works; but probably because you are running an embedded-mode Drillbit. Just to emphasize this: Drill works fine as an embedded desktop tool. But, it is designed to run well on clusters, with distributed storage and multiple machines all working away on large queries. To assign memory, consider your use case. Your second image is a screen shot of one line of the Drill web console showing the Drillbit using .2GB of 8GB of heap, 0GB of direct memory, and basically 0% CPU. You did not say if this is during a query or between queries. I assume it is between queries. You mention that you want to "reduce file generation time", but you did not state the kind of file you are reading, or the expected sizes of the input and output files. (The message title does state the output is Parquet.) I'll guess that both files reside on your local machine. So, depending on disk type (SSD or HDD), you can expect maybe 50 MB/s (HDD) to 200MB/s (SSD) IO throughput. If you want to process a 1GB file, you will need to do 2GB of I/O. At 100MB/s, it will take 20 seconds just for the I/O, maybe more if the HDD starts seeking between input and output files. This is why a production Drillbit runs on multiple servers: to spread out the I/O. Another issue might be that your input is all one big file. In this case, Drill will run in a single thread, with no parallelism. Drill works better if your input is divided into multiple files. (Or, multiple blocks in HDFS or S3.) On the local system, create a directory that contains your file split into four, eight or more chunks. That way, Drill can put all your CPUs to work for CPU-intensive tasks such as filtering, computing values, and so on. At times like this, the query profile is your friend. The amount of information can be overwhelming. Look at the total run time. Then, look at the time in the various operators. Which ones take time? Only the scan and root (the root writes your output file)? Or, do you have a join, sort, or other complex operation? How much parallelism are you getting? You would prefer to keep all your CPUs busy. These are a few hints to help you get started. Please feel free to report back your findings and perhaps give us a bit more of a description of what you are trying to accomplish. Thanks, - Paul [1] https://drill.apache.org/docs/configuring-drill-memory/ On Wednesday, March 11, 2020, 5:42:39 AM PDT, Vishwajeet Anantvilas SONUNE <vishwajeet.anantvilas.son...@hsbc.co.in> wrote: Hi Team, Learned about apache drill that – ‘Drill is memory intensive and therefore requires sufficient memory to run optimally. You can modify how much memory that you want allocated to Drill. Drill typically performs better with as much memory as possible.’ With this I tried allocating as much as memory I could for drill to run. I’m running the drill on my local machine so configured the JAVA_TOOL_OPTIONS to 8GB as Environment variable. Which in turn increased the heap memory. While running a query for generating a parquet file from SQL Server having millions of record, the drill just uses 3 – 4 % of heap memory. Any ways there is no increase in the performance(reduce in time of file generation). Can you please let us know if there’s a way to reduce the file generation time? Please let me know if any further details are required. Looking for your reply to Vishwajeet Anantvilas SONUNE <vishwajeet.anantvilas.son...@hsbc.co.in> Thanks Vishwajeet Sonune ******************************************************************* This e-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return e-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ******************************************************************* "SAVE PAPER - THINK BEFORE YOU PRINT!"