[ 
https://issues.apache.org/jira/browse/DRILL-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benj updated DRILL-7602:
------------------------
    Description: 
It will be nice and usefull in certain situations to have the capacity to do 
repartition like in spark 
([https://spark.apache.org/docs/latest/rdd-programming-guide.html])

either an automatically repartition in certain limit or possibility to indicate 
the desired repartition or both options.

The only way (that I know now) to do that with Drill is to change 
_store.parquet.block-size_ and regenerate the input file then do the request 
(but it will be nice to have the ability to do that on read)

illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
_file2_ of 1Mo (5000 rows)
{code:sql}
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
=> ~ 50min

-- Today we have to change the parquet block size to force multiple parquet 
files
ALTER SESSION SET `store.parquet.block-size` = 1048576;
-- Repartition data
CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
-- then Launch the request
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
=> ~ 1min
{code}
So it's possible to save a lot of time (depending of configuration of cluster) 
by simply forcing more input file.
 it would be useful not to have to regenerate the files with the ideal 
fragmentation before request.

This situation easily appears when making inequality JOIN (to lookup ip in ip 
range for example) on not so big dataset:
{code:java}
ALTER SESSION SET `planner.enable_nljoin_for_scalar_only` = false;
SELECT  * 
FROM dfs.test.`a_pqt` AS a
INNER JOIN dfs.test.`b_pqt` AS b
ON inet_aton(b.ip) >= inet_aton(a.ip_first) AND inet_aton(b.ip) <= 
inet_aton(a.ip_last);
{code}

  was:
It will be nice and usefull ion certain situations to have the capacity to do 
repartition like in spark  
(https://spark.apache.org/docs/latest/rdd-programming-guide.html)

either an automatically repartition in certain limit or possibility to indicate 
the desired repartition or both options.

The only way (that I know now) to do that with Drill is to change 
_store.parquet.block-size_ and regenerate the input file then do the request 
(but it will be nice to have the ability to do that on read)

illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
_file2_ of 1Mo (5000 rows)
{code:sql}
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
=> ~ 50min

-- Tody we have to change the parquet block size to force multiple parquet files
ALTER SESSION SET `store.parquet.block-size` = 1048576;
-- Repartition data
CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
-- then Launch the request
CREATE TABLE dfs.test.`result_from_1_parquet` AS 
(SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
=> ~ 1min
{code}

So it's possible to save a lot of time (depending of configuration of cluster) 
by simply forcing more input file.
it would be useful not to have to regenerate the files with the ideal 
fragmentation before request.

 




> Possibility to force repartition on read/select
> -----------------------------------------------
>
>                 Key: DRILL-7602
>                 URL: https://issues.apache.org/jira/browse/DRILL-7602
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow, Query Planning &amp; Optimization
>    Affects Versions: 1.17.0
>            Reporter: benj
>            Priority: Major
>
> It will be nice and usefull in certain situations to have the capacity to do 
> repartition like in spark 
> ([https://spark.apache.org/docs/latest/rdd-programming-guide.html])
> either an automatically repartition in certain limit or possibility to 
> indicate the desired repartition or both options.
> The only way (that I know now) to do that with Drill is to change 
> _store.parquet.block-size_ and regenerate the input file then do the request 
> (but it will be nice to have the ability to do that on read)
> illustration : with 2 Parquets files _file1_ of 50Mo (1 million rows) and 
> _file2_ of 1Mo (5000 rows)
> {code:sql}
> CREATE TABLE dfs.test.`result_from_1_parquet` AS 
> (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1` ON ...)
> => ~ 50min
> -- Today we have to change the parquet block size to force multiple parquet 
> files
> ALTER SESSION SET `store.parquet.block-size` = 1048576;
> -- Repartition data
> CREATE TABLE dfs.test.`file1_bis` AS (SELECT * FROM dfs.test.`file1`);
> -- then Launch the request
> CREATE TABLE dfs.test.`result_from_1_parquet` AS 
> (SELECT * FROM dfs.test.`file2` INNER JOIN dfs.test.`file1_bis` ON ...)
> => ~ 1min
> {code}
> So it's possible to save a lot of time (depending of configuration of 
> cluster) by simply forcing more input file.
>  it would be useful not to have to regenerate the files with the ideal 
> fragmentation before request.
> This situation easily appears when making inequality JOIN (to lookup ip in ip 
> range for example) on not so big dataset:
> {code:java}
> ALTER SESSION SET `planner.enable_nljoin_for_scalar_only` = false;
> SELECT  * 
> FROM dfs.test.`a_pqt` AS a
> INNER JOIN dfs.test.`b_pqt` AS b
> ON inet_aton(b.ip) >= inet_aton(a.ip_first) AND inet_aton(b.ip) <= 
> inet_aton(a.ip_last);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to