Ajantha Bhat created CARBONDATA-3118:
----------------------------------------

             Summary: Parallelize block pruning of default datamap in driver  
for filter query processing
                 Key: CARBONDATA-3118
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3118
             Project: CarbonData
          Issue Type: Improvement
            Reporter: Ajantha Bhat
            Assignee: Ajantha Bhat


*"Parallelize block pruning of default datamap in driver 
for filter query processing"* 

*Background:* 
We do block pruning for the filter queries at the driver side. 
In real time big data scenario, we can have millions of carbon files for 
one carbon table. 
It is currently observed that for 1 million carbon files it takes around 5 
seconds for block pruning. As each carbon file takes around 0.005ms for 
pruning 
(with only one filter columns set in 'column_meta_cache' tblproperty). 
If the files are more, we might take more time for block pruning. 
Also, spark Job will not be launched until block pruning is completed. so, 
the user will not know what is happening at that time and why spark job is 
not launching. 
currently, block pruning is taking time as each segment processing is 
happening sequentially. we can reduce the time by parallelizing it. 


*solution:*Keep default number of threads for block pruning as 4. 
User can reduce this number by a carbon property 
"carbon.max.driver.threads.for.pruning" to set between -> 1 to 4. 

In TableDataMap.prune(), 

group the segments as per the threads by distributing equal carbon files to 
each thread. 
Launch the threads for a group of segments to handle block pruning. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to