Re: Help with time taken to generate parquet file.

Paul Rogers Wed, 11 Mar 2020 23:26:25 -0700

Hi Vishwajeet,

Welcome to the Drill community. As it turns out, our mailing list does not 
forward images. But, fortunately, the moderation software preserved the images 
so I was able to find them. Let me tackle your questions one by one.


Like all generalizations, saying "Drill needs lots of memory" is relative. The 
statement applies to a production system, running against large files, with 
many concurrent users. It probably does not apply to your local machine running 
a few sample queries.

What drives memory usage? It is not just file size. It is the buffered size. If 
you scan 1TB of data with a simple query with only a WHERE clause, Drill will 
use very little memory. But, if you sort the 1TB of data, Drill will obviously 
need lots of memory to perform the sort. For sort (and several other 
operations), if there is not enough memory, Drill will spill to disk, which is 
slow. (At least three IOs for each block of data instead of just one.)

Second, the variable you used to set memory:

JAVA_TOOL_OPTIONS -Xmx8192m

Is not the documented way to set memory. See [1] for the preferred approach. 
Looks like your approach works; but probably because you are running an 
embedded-mode Drillbit.


Just to emphasize this: Drill works fine as an embedded desktop tool. But, it 
is designed to run well on clusters, with distributed storage and multiple 
machines all working away on large queries.


To assign memory, consider your use case. Your second image is a screen shot of 
one line of the Drill web console showing the Drillbit using .2GB of 8GB of 
heap, 0GB of direct memory, and basically 0% CPU. You did not say if this is 
during a query or between queries. I assume it is between queries.


You mention that you want to "reduce file generation time", but you did not 
state the kind of file you are reading, or the expected sizes of the input and 
output files. (The message title does state the output is Parquet.) I'll guess 
that both files reside on your local machine. So, depending on disk type (SSD 
or HDD), you can expect maybe 50 MB/s (HDD) to 200MB/s (SSD) IO throughput. If 
you want to process a 1GB file, you will need to do 2GB of I/O. At 100MB/s, it 
will take 20 seconds just for the I/O, maybe more if the HDD starts seeking 
between input and output files. This is why a production Drillbit runs on 
multiple servers: to spread out the I/O.

Another issue might be that your input is all one big file. In this case, Drill 
will run in a single thread, with no parallelism. Drill works better if your 
input is divided into multiple files. (Or, multiple blocks in HDFS or S3.) On 
the local system, create a directory that contains your file split into four, 
eight or more chunks. That way, Drill can put all your CPUs to work for 
CPU-intensive tasks such as filtering, computing values, and so on.


At times like this, the query profile is your friend. The amount of information 
can be overwhelming. Look at the total run time. Then, look at the time in the 
various operators. Which ones take time? Only the scan and root (the root 
writes your output file)? Or, do you have a join, sort, or other complex 
operation? How much parallelism are you getting? You would prefer to keep all 
your CPUs busy.

These are a few hints to help you get started. Please feel free to report back 
your findings and perhaps give us a bit more of a description of what you are 
trying to accomplish.


Thanks,
- Paul

[1] https://drill.apache.org/docs/configuring-drill-memory/



 

    On Wednesday, March 11, 2020, 5:42:39 AM PDT, Vishwajeet Anantvilas SONUNE 
<[email protected]> wrote:  
 
  
Hi Team,
 
  
 
Learned about apache drill that – ‘Drill is memory intensive and therefore 
requires sufficient memory to run optimally. You can modify how much memory 
that you want allocated to Drill. Drill typically performs better with as much 
memory as possible.’
 
  
 
With this I tried allocating as much as memory I could for drill to run. I’m 
running the drill on my local machine so configured the JAVA_TOOL_OPTIONS to 
8GB as Environment variable. Which in turn increased the heap memory. 
 
  
 

 
  
 
While running a query for generating a parquet file from SQL Server having 
millions of record, the drill just uses 3 – 4 % of heap memory. Any ways there 
is no increase in the performance(reduce in time of file generation).
 
  
 
Can you please let us know if there’s a way to reduce the file generation time?
 
  
 
  
 

 
  
 
  
 
Please let me know if any further details are required.
 
  
 
  
 
Looking for your reply to Vishwajeet Anantvilas SONUNE 
<[email protected]>
 
  
 
  
 
Thanks
 
Vishwajeet Sonune
 
  
 






*******************************************************************
This e-mail is confidential. It may also be legally privileged. 
If you are not the addressee you may not copy, forward, disclose 
or use any part of it. If you have received this message in error, 
please delete it and all copies from your system and notify the 
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely, 
secure, error or virus-free. The sender does not accept liability 
for any errors or omissions.
*******************************************************************

"SAVE PAPER - THINK BEFORE YOU PRINT!"

Re: Help with time taken to generate parquet file.

Reply via email to