[jira] [Resolved] (HIVE-17102) Example For Vectorized Execution in Hive in Cwiki not Seems to Work
[ https://issues.apache.org/jira/browse/HIVE-17102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anubhav tarar resolved HIVE-17102. -- Resolution: Fixed > Example For Vectorized Execution in Hive in Cwiki not Seems to Work > --- > > Key: HIVE-17102 > URL: https://issues.apache.org/jira/browse/HIVE-17102 > Project: Hive > Issue Type: Bug > Components: Documentation >Affects Versions: 1.2.0 >Reporter: anubhav tarar >Assignee: anubhav tarar > > i tried to do vectorized execution in hive by using hive cwiki but example do > not seems to work > step1:created a orc table > hive> create table Addresses ( > > name string, > > street string, > > city string, > > state string, > > zip int > > ) stored as orc tblproperties ("orc.compress"="NONE"); > step2:insert the values in table > hive> insert into Addresses values('anubhav','ggn','ggn','haryana','122001'); > Query ID = hduser_20170716093152_14774003-d2c4-4620-b773-ca17cafd902b > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Listening for transport dt_socket at address: 5005 > Job running in-process (local Hadoop) > 2017-07-16 09:31:59,689 Stage-1 map = 100%, reduce = 0% > Ended Job = job_local1858411694_0004 > Stage-4 is selected by condition resolver. > Stage-3 is filtered out by condition resolver. > Stage-5 is filtered out by condition resolver. > Moving data to: > hdfs://localhost:54310/user/hive/warehouse/addresses/.hive-staging_hive_2017-07-16_09-31-52_428_7861150459629073282-1/-ext-1 > Loading data to table default.addresses > Table default.addresses stats: [numFiles=1, numRows=1, totalSize=713, > rawDataSize=360] > MapReduce Jobs Launched: > Stage-Stage-1: HDFS Read: 778 HDFS Write: 818 SUCCESS > Total MapReduce CPU Time Spent: 0 msec > step3:query the table with explain command > hive> set hive.vectorized.execution.enabled = true; > hive> explain select name from Addresses where zip>1; > OK > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: addresses > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: (zip > 1) (type: boolean) > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: name (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE > Column stats: NONE > ListSink > Time taken: 0.081 seconds, Fetched: 20 row(s) > note:in explain command there is not vectorized reader applied > reason for failiure is that When Fetch is used in the plan instead of Map, it > do not vectorize -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17102) Example For Vectorized Execution in Hive in Cwiki not Seems to Work
[ https://issues.apache.org/jira/browse/HIVE-17102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anubhav tarar updated HIVE-17102: - Description: i tried to do vectorized execution in hive by using hive cwiki but example do not seems to work step1:created a orc table hive> create table Addresses ( > name string, > street string, > city string, > state string, > zip int > ) stored as orc tblproperties ("orc.compress"="NONE"); step2:insert the values in table hive> insert into Addresses values('anubhav','ggn','ggn','haryana','122001'); Query ID = hduser_20170716093152_14774003-d2c4-4620-b773-ca17cafd902b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Listening for transport dt_socket at address: 5005 Job running in-process (local Hadoop) 2017-07-16 09:31:59,689 Stage-1 map = 100%, reduce = 0% Ended Job = job_local1858411694_0004 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://localhost:54310/user/hive/warehouse/addresses/.hive-staging_hive_2017-07-16_09-31-52_428_7861150459629073282-1/-ext-1 Loading data to table default.addresses Table default.addresses stats: [numFiles=1, numRows=1, totalSize=713, rawDataSize=360] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 778 HDFS Write: 818 SUCCESS Total MapReduce CPU Time Spent: 0 msec step3:query the table with explain command hive> set hive.vectorized.execution.enabled = true; hive> explain select name from Addresses where zip>1; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: addresses Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (zip > 1) (type: boolean) Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: name (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.081 seconds, Fetched: 20 row(s) note:in explain command there is not vectorized reader applied reason for failiure is that When Fetch is used in the plan instead of Map, it do not vectorize was: i tried to do vectorized execution in hive by using hive cwiki but example do not seems to work step1:created a orc table hive> create table Addresses ( > name string, > street string, > city string, > state string, > zip int > ) stored as orc tblproperties ("orc.compress"="NONE"); step2:insert the values in table hive> insert into Addresses values('anubhav','ggn','ggn','haryana','122001'); Query ID = hduser_20170716093152_14774003-d2c4-4620-b773-ca17cafd902b Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Listening for transport dt_socket at address: 5005 Job running in-process (local Hadoop) 2017-07-16 09:31:59,689 Stage-1 map = 100%, reduce = 0% Ended Job = job_local1858411694_0004 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://localhost:54310/user/hive/warehouse/addresses/.hive-staging_hive_2017-07-16_09-31-52_428_7861150459629073282-1/-ext-1 Loading data to table default.addresses Table default.addresses stats: [numFiles=1, numRows=1, totalSize=713, rawDataSize=360] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 778 HDFS Write: 818 SUCCESS Total MapReduce CPU Time Spent: 0 msec step3:query the table with explain command hive> set hive.vectorized.execution.enabled = true; hive> explain select name from Addresses where zip>1; OK STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: addresses Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (zip > 1) (type: boolean) Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: name (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column stats: NONE ListSink Time taken: 0.081 seconds, Fetched: 20 row(s) note:in explain command there is not vectorized reader applied i updated hive cwiki for the same https://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution > Example For Vectorized Execution in
[jira] [Assigned] (HIVE-17102) Example For Vectorized Execution in Hive in Cwiki not Seems to Work
[ https://issues.apache.org/jira/browse/HIVE-17102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] anubhav tarar reassigned HIVE-17102: > Example For Vectorized Execution in Hive in Cwiki not Seems to Work > --- > > Key: HIVE-17102 > URL: https://issues.apache.org/jira/browse/HIVE-17102 > Project: Hive > Issue Type: Bug > Components: Documentation >Affects Versions: 1.2.0 >Reporter: anubhav tarar >Assignee: anubhav tarar > > i tried to do vectorized execution in hive by using hive cwiki but example do > not seems to work > step1:created a orc table > hive> create table Addresses ( > > name string, > > street string, > > city string, > > state string, > > zip int > > ) stored as orc tblproperties ("orc.compress"="NONE"); > step2:insert the values in table > hive> insert into Addresses values('anubhav','ggn','ggn','haryana','122001'); > Query ID = hduser_20170716093152_14774003-d2c4-4620-b773-ca17cafd902b > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Listening for transport dt_socket at address: 5005 > Job running in-process (local Hadoop) > 2017-07-16 09:31:59,689 Stage-1 map = 100%, reduce = 0% > Ended Job = job_local1858411694_0004 > Stage-4 is selected by condition resolver. > Stage-3 is filtered out by condition resolver. > Stage-5 is filtered out by condition resolver. > Moving data to: > hdfs://localhost:54310/user/hive/warehouse/addresses/.hive-staging_hive_2017-07-16_09-31-52_428_7861150459629073282-1/-ext-1 > Loading data to table default.addresses > Table default.addresses stats: [numFiles=1, numRows=1, totalSize=713, > rawDataSize=360] > MapReduce Jobs Launched: > Stage-Stage-1: HDFS Read: 778 HDFS Write: 818 SUCCESS > Total MapReduce CPU Time Spent: 0 msec > step3:query the table with explain command > hive> set hive.vectorized.execution.enabled = true; > hive> explain select name from Addresses where zip>1; > OK > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: addresses > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: (zip > 1) (type: boolean) > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE > Column stats: NONE > Select Operator > expressions: name (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 360 Basic stats: COMPLETE > Column stats: NONE > ListSink > Time taken: 0.081 seconds, Fetched: 20 row(s) > note:in explain command there is not vectorized reader applied -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16793) Scalar sub-query: sq_count_check not required if gby keys are constant
[ https://issues.apache.org/jira/browse/HIVE-16793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088801#comment-16088801 ] Lefty Leverenz commented on HIVE-16793: --- Thanks for the doc, [~vgarg] -- it looks good. Question: Should sq_count_check be documented along with the other UDFs, or is it for internal use only? HIVE-15544 introduced it, so if sq_count_check needs to be documented we should update the doc note there. > Scalar sub-query: sq_count_check not required if gby keys are constant > -- > > Key: HIVE-16793 > URL: https://issues.apache.org/jira/browse/HIVE-16793 > Project: Hive > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gopal V >Assignee: Vineet Garg > Fix For: 3.0.0 > > Attachments: HIVE-16793.1.patch, HIVE-16793.2.patch, > HIVE-16793.3.patch, HIVE-16793.4.patch, HIVE-16793.5.patch, HIVE-16793.6.patch > > > This query has an sq_count check, though is useless on a constant key. > {code} > hive> explain select * from part where p_size > (select max(p_size) from part > where p_type = '1' group by p_type); > Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross product > Warning: Map Join MAPJOIN[36][bigTable=?] in task 'Map 1' is a cross product > OK > Plan optimized by CBO. > Vertex dependency in root stage > Map 1 <- Reducer 4 (BROADCAST_EDGE), Reducer 6 (BROADCAST_EDGE) > Reducer 3 <- Map 2 (SIMPLE_EDGE) > Reducer 4 <- Reducer 3 (CUSTOM_SIMPLE_EDGE) > Reducer 6 <- Map 5 (SIMPLE_EDGE) > Stage-0 > Fetch Operator > limit:-1 > Stage-1 > Map 1 vectorized, llap > File Output Operator [FS_64] > Select Operator [SEL_63] (rows= width=621) > > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"] > Filter Operator [FIL_62] (rows= width=625) > predicate:(_col5 > _col10) > Map Join Operator [MAPJOIN_61] (rows=2 width=625) > > Conds:(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col10"] > <-Reducer 6 [BROADCAST_EDGE] vectorized, llap > BROADCAST [RS_58] > Select Operator [SEL_57] (rows=1 width=4) > Output:["_col0"] > Group By Operator [GBY_56] (rows=1 width=89) > > Output:["_col0","_col1"],aggregations:["max(VALUE._col0)"],keys:KEY._col0 > <-Map 5 [SIMPLE_EDGE] vectorized, llap > SHUFFLE [RS_55] > PartitionCols:_col0 > Group By Operator [GBY_54] (rows=86 width=89) > > Output:["_col0","_col1"],aggregations:["max(_col1)"],keys:'1' > Select Operator [SEL_53] (rows=1212121 width=109) > Output:["_col1"] > Filter Operator [FIL_52] (rows=1212121 width=109) > predicate:(p_type = '1') > TableScan [TS_17] (rows=2 width=109) > > tpch_flat_orc_1000@part,part,Tbl:COMPLETE,Col:COMPLETE,Output:["p_type","p_size"] > <-Map Join Operator [MAPJOIN_60] (rows=2 width=621) > > Conds:(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"] > <-Reducer 4 [BROADCAST_EDGE] vectorized, llap > BROADCAST [RS_51] > Select Operator [SEL_50] (rows=1 width=8) > Filter Operator [FIL_49] (rows=1 width=8) > predicate:(sq_count_check(_col0) <= 1) > Group By Operator [GBY_48] (rows=1 width=8) > Output:["_col0"],aggregations:["count(VALUE._col0)"] > <-Reducer 3 [CUSTOM_SIMPLE_EDGE] vectorized, llap > PARTITION_ONLY_SHUFFLE [RS_47] > Group By Operator [GBY_46] (rows=1 width=8) > Output:["_col0"],aggregations:["count()"] > Select Operator [SEL_45] (rows=1 width=85) > Group By Operator [GBY_44] (rows=1 width=85) > Output:["_col0"],keys:KEY._col0 > <-Map 2 [SIMPLE_EDGE] vectorized, llap > SHUFFLE [RS_43] > PartitionCols:_col0 > Group By Operator [GBY_42] (rows=83 > width=85) > Output:["_col0"],keys:'1' > Select Operator [SEL_41] (rows=1212121 > width=105) > Filter Operator
[jira] [Commented] (HIVE-4577) hive CLI can't handle hadoop dfs command with space and quotes.
[ https://issues.apache.org/jira/browse/HIVE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088785#comment-16088785 ] Lefty Leverenz commented on HIVE-4577: -- Doc note: This should be documented in two wikidocs that describe the dfs command: * [Commands | https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Commands] * [CLI -- Hive Interactive Shell Commands | https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveInteractiveShellCommands] Added a TODOC3.0 label. (A TODOC2.2.0 label should also be added if the patch gets committed to branch-2.2.) > hive CLI can't handle hadoop dfs command with space and quotes. > > > Key: HIVE-4577 > URL: https://issues.apache.org/jira/browse/HIVE-4577 > Project: Hive > Issue Type: Bug > Components: CLI >Affects Versions: 0.9.0, 0.10.0, 0.14.0, 0.13.1, 1.2.0, 1.1.0 >Reporter: Bing Li >Assignee: Bing Li > Labels: TODOC3.0 > Fix For: 2.2.0, 3.0.0 > > Attachments: HIVE-4577.1.patch, HIVE-4577.2.patch, > HIVE-4577.3.patch.txt, HIVE-4577.4.patch, HIVE-4577.5.patch, HIVE-4577.6.patch > > > As design, hive could support hadoop dfs command in hive shell, like > hive> dfs -mkdir /user/biadmin/mydir; > but has different behavior with hadoop if the path contains space and quotes > hive> dfs -mkdir "hello"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:40 > /user/biadmin/"hello" > hive> dfs -mkdir 'world'; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:43 > /user/biadmin/'world' > hive> dfs -mkdir "bei jing"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/"bei > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/jing" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-4577) hive CLI can't handle hadoop dfs command with space and quotes.
[ https://issues.apache.org/jira/browse/HIVE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lefty Leverenz updated HIVE-4577: - Labels: TODOC3.0 (was: ) > hive CLI can't handle hadoop dfs command with space and quotes. > > > Key: HIVE-4577 > URL: https://issues.apache.org/jira/browse/HIVE-4577 > Project: Hive > Issue Type: Bug > Components: CLI >Affects Versions: 0.9.0, 0.10.0, 0.14.0, 0.13.1, 1.2.0, 1.1.0 >Reporter: Bing Li >Assignee: Bing Li > Labels: TODOC3.0 > Fix For: 2.2.0, 3.0.0 > > Attachments: HIVE-4577.1.patch, HIVE-4577.2.patch, > HIVE-4577.3.patch.txt, HIVE-4577.4.patch, HIVE-4577.5.patch, HIVE-4577.6.patch > > > As design, hive could support hadoop dfs command in hive shell, like > hive> dfs -mkdir /user/biadmin/mydir; > but has different behavior with hadoop if the path contains space and quotes > hive> dfs -mkdir "hello"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:40 > /user/biadmin/"hello" > hive> dfs -mkdir 'world'; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:43 > /user/biadmin/'world' > hive> dfs -mkdir "bei jing"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/"bei > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/jing" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-4577) hive CLI can't handle hadoop dfs command with space and quotes.
[ https://issues.apache.org/jira/browse/HIVE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088781#comment-16088781 ] Lefty Leverenz commented on HIVE-4577: -- [~vgumashta], is this also going to be committed to branch-2.2? If not, the fix version should only show 3.0.0. Thanks. > hive CLI can't handle hadoop dfs command with space and quotes. > > > Key: HIVE-4577 > URL: https://issues.apache.org/jira/browse/HIVE-4577 > Project: Hive > Issue Type: Bug > Components: CLI >Affects Versions: 0.9.0, 0.10.0, 0.14.0, 0.13.1, 1.2.0, 1.1.0 >Reporter: Bing Li >Assignee: Bing Li > Fix For: 2.2.0, 3.0.0 > > Attachments: HIVE-4577.1.patch, HIVE-4577.2.patch, > HIVE-4577.3.patch.txt, HIVE-4577.4.patch, HIVE-4577.5.patch, HIVE-4577.6.patch > > > As design, hive could support hadoop dfs command in hive shell, like > hive> dfs -mkdir /user/biadmin/mydir; > but has different behavior with hadoop if the path contains space and quotes > hive> dfs -mkdir "hello"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:40 > /user/biadmin/"hello" > hive> dfs -mkdir 'world'; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:43 > /user/biadmin/'world' > hive> dfs -mkdir "bei jing"; > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/"bei > drwxr-xr-x - biadmin supergroup 0 2013-04-23 09:44 > /user/biadmin/jing" -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16996) Add HLL as an alternative to FM sketch to compute stats
[ https://issues.apache.org/jira/browse/HIVE-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088775#comment-16088775 ] Lefty Leverenz commented on HIVE-16996: --- Doc note: This adds *hive.stats.ndv.algo* to HiveConf.java, so it needs to be documented in the Statistics section of Configuration Properties. * [Configuration Properties -- Statistics | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-Statistics] Added a TODOC3.0 label. > Add HLL as an alternative to FM sketch to compute stats > --- > > Key: HIVE-16996 > URL: https://issues.apache.org/jira/browse/HIVE-16996 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: Accuracy and performance comparison between HyperLogLog > and FM Sketch.docx, HIVE-16966.01.patch, HIVE-16966.02.patch, > HIVE-16966.03.patch, HIVE-16966.04.patch, HIVE-16966.05.patch, > HIVE-16966.06.patch, HIVE-16966.07.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16996) Add HLL as an alternative to FM sketch to compute stats
[ https://issues.apache.org/jira/browse/HIVE-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lefty Leverenz updated HIVE-16996: -- Labels: TODOC3.0 (was: ) > Add HLL as an alternative to FM sketch to compute stats > --- > > Key: HIVE-16996 > URL: https://issues.apache.org/jira/browse/HIVE-16996 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: Accuracy and performance comparison between HyperLogLog > and FM Sketch.docx, HIVE-16966.01.patch, HIVE-16966.02.patch, > HIVE-16966.03.patch, HIVE-16966.04.patch, HIVE-16966.05.patch, > HIVE-16966.06.patch, HIVE-16966.07.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16997) Extend object store to store bit vectors
[ https://issues.apache.org/jira/browse/HIVE-16997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088762#comment-16088762 ] Pengcheng Xiong commented on HIVE-16997: after transfering bit vector in FMsketch to 4bytes, we need 1024*4=4196bytes for FM sketch. > Extend object store to store bit vectors > > > Key: HIVE-16997 > URL: https://issues.apache.org/jira/browse/HIVE-16997 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-17018) Small table is converted to map join even the total size of small tables exceeds the threshold(hive.auto.convert.join.noconditionaltask.size)
[ https://issues.apache.org/jira/browse/HIVE-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088721#comment-16088721 ] Carter Shanklin commented on HIVE-17018: My inputs: That particular variable can't be renamed to something spark specific since all engines use it Adding a net new variable for Spark would increase confusion rather than decrease it. It would be good to have some sort of descriptive name that applies to both Tez and MR. As pointed out there is no relation between what that variable used to do and what it does today, and the implication of changing that parameter is difficult to guess. Maybe a new variable like hive.auto.convert.join.max.hashtable.size could be introduced. Both engines switch to that variable at some point, then usage of the old variable could be deprecated and then removed. Just my inputs. /cc [~ashutoshc] > Small table is converted to map join even the total size of small tables > exceeds the threshold(hive.auto.convert.join.noconditionaltask.size) > - > > Key: HIVE-17018 > URL: https://issues.apache.org/jira/browse/HIVE-17018 > Project: Hive > Issue Type: Bug >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: HIVE-17018_data_init.q, HIVE-17018.q, t3.txt > > > we use "hive.auto.convert.join.noconditionaltask.size" as the threshold. it > means the sum of size for n-1 of the tables/partitions for a n-way join is > smaller than it, it will be converted to a map join. for example, A join B > join C join D join E. Big table is A(100M), small tables are > B(10M),C(10M),D(10M),E(10M). If we set > hive.auto.convert.join.noconditionaltask.size=20M. In current code, E,D,B > will be converted to map join but C will not be converted to map join. In my > understanding, because hive.auto.convert.join.noconditionaltask.size can only > contain E and D, so C and B should not be converted to map join. > Let's explain more why E can be converted to map join. > in current code, > [SparkMapJoinOptimizer#getConnectedMapJoinSize|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L364] > calculates all the mapjoins in the parent path and child path. The search > stops when encountering [UnionOperator or > ReduceOperator|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L381]. > Because C is not converted to map join because {{connectedMapJoinSize + > totalSize) > maxSize}} [see > code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L330].The > RS before the join of C remains. When calculating whether B will be > converted to map join, {{getConnectedMapJoinSize}} returns 0 as encountering > [RS > |https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#409] > and causes {{connectedMapJoinSize + totalSize) < maxSize}} matches. > [~xuefuz] or [~jxiang]: can you help see whether this is a bug or not as you > are more familiar with SparkJoinOptimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16996) Add HLL as an alternative to FM sketch to compute stats
[ https://issues.apache.org/jira/browse/HIVE-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088685#comment-16088685 ] Pengcheng Xiong commented on HIVE-16996: updated related q file changes, pushed to master. thanks [~ashutoshc] and [~prasanth_j] for the review. > Add HLL as an alternative to FM sketch to compute stats > --- > > Key: HIVE-16996 > URL: https://issues.apache.org/jira/browse/HIVE-16996 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > Fix For: 3.0.0 > > Attachments: Accuracy and performance comparison between HyperLogLog > and FM Sketch.docx, HIVE-16966.01.patch, HIVE-16966.02.patch, > HIVE-16966.03.patch, HIVE-16966.04.patch, HIVE-16966.05.patch, > HIVE-16966.06.patch, HIVE-16966.07.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16996) Add HLL as an alternative to FM sketch to compute stats
[ https://issues.apache.org/jira/browse/HIVE-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pengcheng Xiong updated HIVE-16996: --- Fix Version/s: 3.0.0 > Add HLL as an alternative to FM sketch to compute stats > --- > > Key: HIVE-16996 > URL: https://issues.apache.org/jira/browse/HIVE-16996 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > Fix For: 3.0.0 > > Attachments: Accuracy and performance comparison between HyperLogLog > and FM Sketch.docx, HIVE-16966.01.patch, HIVE-16966.02.patch, > HIVE-16966.03.patch, HIVE-16966.04.patch, HIVE-16966.05.patch, > HIVE-16966.06.patch, HIVE-16966.07.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16996) Add HLL as an alternative to FM sketch to compute stats
[ https://issues.apache.org/jira/browse/HIVE-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pengcheng Xiong updated HIVE-16996: --- Resolution: Fixed Status: Resolved (was: Patch Available) > Add HLL as an alternative to FM sketch to compute stats > --- > > Key: HIVE-16996 > URL: https://issues.apache.org/jira/browse/HIVE-16996 > Project: Hive > Issue Type: Sub-task >Reporter: Pengcheng Xiong >Assignee: Pengcheng Xiong > Fix For: 3.0.0 > > Attachments: Accuracy and performance comparison between HyperLogLog > and FM Sketch.docx, HIVE-16966.01.patch, HIVE-16966.02.patch, > HIVE-16966.03.patch, HIVE-16966.04.patch, HIVE-16966.05.patch, > HIVE-16966.06.patch, HIVE-16966.07.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-12631) LLAP: support ORC ACID tables
[ https://issues.apache.org/jira/browse/HIVE-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088596#comment-16088596 ] Teddy Choi commented on HIVE-12631: --- It looks like there's no failing test that is caused by this issue anymore. > LLAP: support ORC ACID tables > - > > Key: HIVE-12631 > URL: https://issues.apache.org/jira/browse/HIVE-12631 > Project: Hive > Issue Type: Bug > Components: llap, Transactions >Reporter: Sergey Shelukhin >Assignee: Teddy Choi > Attachments: HIVE-12631.10.patch, HIVE-12631.10.patch, > HIVE-12631.11.patch, HIVE-12631.11.patch, HIVE-12631.12.patch, > HIVE-12631.13.patch, HIVE-12631.15.patch, HIVE-12631.16.patch, > HIVE-12631.17.patch, HIVE-12631.18.patch, HIVE-12631.19.patch, > HIVE-12631.1.patch, HIVE-12631.20.patch, HIVE-12631.21.patch, > HIVE-12631.22.patch, HIVE-12631.23.patch, HIVE-12631.2.patch, > HIVE-12631.3.patch, HIVE-12631.4.patch, HIVE-12631.5.patch, > HIVE-12631.6.patch, HIVE-12631.7.patch, HIVE-12631.8.patch, > HIVE-12631.8.patch, HIVE-12631.9.patch > > > LLAP uses a completely separate read path in ORC to allow for caching and > parallelization of reads and processing. This path does not support ACID. As > far as I remember ACID logic is embedded inside ORC format; we need to > refactor it to be on top of some interface, if practical; or just port it to > LLAP read path. > Another consideration is how the logic will work with cache. The cache is > currently low-level (CB-level in ORC), so we could just use it to read bases > and deltas (deltas should be cached with higher priority) and merge as usual. > We could also cache merged representation in future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-16990) REPL LOAD should update last repl ID only after successful copy of data files.
[ https://issues.apache.org/jira/browse/HIVE-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088570#comment-16088570 ] Hive QA commented on HIVE-16990: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12877445/HIVE-16990.02.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 11067 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=143) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2] (batchId=169) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=167) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=178) org.apache.hive.jdbc.TestJdbcWithMiniHS2.testHttpRetryOnServerIdleTimeout (batchId=227) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6050/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6050/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6050/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 11 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12877445 - PreCommit-HIVE-Build > REPL LOAD should update last repl ID only after successful copy of data files. > -- > > Key: HIVE-16990 > URL: https://issues.apache.org/jira/browse/HIVE-16990 > Project: Hive > Issue Type: Sub-task > Components: Hive, repl >Affects Versions: 2.1.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-16990.01.patch, HIVE-16990.02.patch > > > For REPL LOAD operations that includes both metadata and data changes should > follow the below rule. > 1. Copy the metadata excluding the last repl ID. > 2. Copy the data files > 3. If Step 1 and 2 are successful, then update the last repl ID of the object. > This rule will allow the the failed events to be re-applied by REPL LOAD and > ensures no data loss due to failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-17097) Fix SemiJoinHint parsing in SemanticAnalyzer
[ https://issues.apache.org/jira/browse/HIVE-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated HIVE-17097: Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Thanks [~gopalv], [~djaiswal]. Committed to master. > Fix SemiJoinHint parsing in SemanticAnalyzer > > > Key: HIVE-17097 > URL: https://issues.apache.org/jira/browse/HIVE-17097 > Project: Hive > Issue Type: Bug >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Fix For: 3.0.0 > > Attachments: HIVE-17097.1.patch, HIVE-17097.2.patch > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-12631) LLAP: support ORC ACID tables
[ https://issues.apache.org/jira/browse/HIVE-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088551#comment-16088551 ] Hive QA commented on HIVE-12631: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12877443/HIVE-12631.23.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 11055 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1] (batchId=238) org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] (batchId=143) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_2] (batchId=169) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_explainuser_1] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_op_stats] (batchId=167) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_use_ts_stats_for_mapjoin] (batchId=168) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_vectorized_dynamic_partition_pruning] (batchId=167) org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] (batchId=99) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] (batchId=233) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] (batchId=233) org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver (batchId=108) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema (batchId=178) org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation (batchId=178) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/6049/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/6049/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-6049/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 15 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12877443 - PreCommit-HIVE-Build > LLAP: support ORC ACID tables > - > > Key: HIVE-12631 > URL: https://issues.apache.org/jira/browse/HIVE-12631 > Project: Hive > Issue Type: Bug > Components: llap, Transactions >Reporter: Sergey Shelukhin >Assignee: Teddy Choi > Attachments: HIVE-12631.10.patch, HIVE-12631.10.patch, > HIVE-12631.11.patch, HIVE-12631.11.patch, HIVE-12631.12.patch, > HIVE-12631.13.patch, HIVE-12631.15.patch, HIVE-12631.16.patch, > HIVE-12631.17.patch, HIVE-12631.18.patch, HIVE-12631.19.patch, > HIVE-12631.1.patch, HIVE-12631.20.patch, HIVE-12631.21.patch, > HIVE-12631.22.patch, HIVE-12631.23.patch, HIVE-12631.2.patch, > HIVE-12631.3.patch, HIVE-12631.4.patch, HIVE-12631.5.patch, > HIVE-12631.6.patch, HIVE-12631.7.patch, HIVE-12631.8.patch, > HIVE-12631.8.patch, HIVE-12631.9.patch > > > LLAP uses a completely separate read path in ORC to allow for caching and > parallelization of reads and processing. This path does not support ACID. As > far as I remember ACID logic is embedded inside ORC format; we need to > refactor it to be on top of some interface, if practical; or just port it to > LLAP read path. > Another consideration is how the logic will work with cache. The cache is > currently low-level (CB-level in ORC), so we could just use it to read bases > and deltas (deltas should be cached with higher priority) and merge as usual. > We could also cache merged representation in future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16990) REPL LOAD should update last repl ID only after successful copy of data files.
[ https://issues.apache.org/jira/browse/HIVE-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-16990: Attachment: HIVE-16990.02.patch Added 02.patch with fix for the pre-commit test failures. > REPL LOAD should update last repl ID only after successful copy of data files. > -- > > Key: HIVE-16990 > URL: https://issues.apache.org/jira/browse/HIVE-16990 > Project: Hive > Issue Type: Sub-task > Components: Hive, repl >Affects Versions: 2.1.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-16990.01.patch, HIVE-16990.02.patch > > > For REPL LOAD operations that includes both metadata and data changes should > follow the below rule. > 1. Copy the metadata excluding the last repl ID. > 2. Copy the data files > 3. If Step 1 and 2 are successful, then update the last repl ID of the object. > This rule will allow the the failed events to be re-applied by REPL LOAD and > ensures no data loss due to failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16990) REPL LOAD should update last repl ID only after successful copy of data files.
[ https://issues.apache.org/jira/browse/HIVE-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-16990: Status: Patch Available (was: Open) > REPL LOAD should update last repl ID only after successful copy of data files. > -- > > Key: HIVE-16990 > URL: https://issues.apache.org/jira/browse/HIVE-16990 > Project: Hive > Issue Type: Sub-task > Components: Hive, repl >Affects Versions: 2.1.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-16990.01.patch, HIVE-16990.02.patch > > > For REPL LOAD operations that includes both metadata and data changes should > follow the below rule. > 1. Copy the metadata excluding the last repl ID. > 2. Copy the data files > 3. If Step 1 and 2 are successful, then update the last repl ID of the object. > This rule will allow the the failed events to be re-applied by REPL LOAD and > ensures no data loss due to failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-16990) REPL LOAD should update last repl ID only after successful copy of data files.
[ https://issues.apache.org/jira/browse/HIVE-16990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sankar Hariappan updated HIVE-16990: Status: Open (was: Patch Available) > REPL LOAD should update last repl ID only after successful copy of data files. > -- > > Key: HIVE-16990 > URL: https://issues.apache.org/jira/browse/HIVE-16990 > Project: Hive > Issue Type: Sub-task > Components: Hive, repl >Affects Versions: 2.1.0 >Reporter: Sankar Hariappan >Assignee: Sankar Hariappan > Labels: DR, replication > Fix For: 3.0.0 > > Attachments: HIVE-16990.01.patch > > > For REPL LOAD operations that includes both metadata and data changes should > follow the below rule. > 1. Copy the metadata excluding the last repl ID. > 2. Copy the data files > 3. If Step 1 and 2 are successful, then update the last repl ID of the object. > This rule will allow the the failed events to be re-applied by REPL LOAD and > ensures no data loss due to failures. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (HIVE-12631) LLAP: support ORC ACID tables
[ https://issues.apache.org/jira/browse/HIVE-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Teddy Choi updated HIVE-12631: -- Attachment: HIVE-12631.23.patch > LLAP: support ORC ACID tables > - > > Key: HIVE-12631 > URL: https://issues.apache.org/jira/browse/HIVE-12631 > Project: Hive > Issue Type: Bug > Components: llap, Transactions >Reporter: Sergey Shelukhin >Assignee: Teddy Choi > Attachments: HIVE-12631.10.patch, HIVE-12631.10.patch, > HIVE-12631.11.patch, HIVE-12631.11.patch, HIVE-12631.12.patch, > HIVE-12631.13.patch, HIVE-12631.15.patch, HIVE-12631.16.patch, > HIVE-12631.17.patch, HIVE-12631.18.patch, HIVE-12631.19.patch, > HIVE-12631.1.patch, HIVE-12631.20.patch, HIVE-12631.21.patch, > HIVE-12631.22.patch, HIVE-12631.23.patch, HIVE-12631.2.patch, > HIVE-12631.3.patch, HIVE-12631.4.patch, HIVE-12631.5.patch, > HIVE-12631.6.patch, HIVE-12631.7.patch, HIVE-12631.8.patch, > HIVE-12631.8.patch, HIVE-12631.9.patch > > > LLAP uses a completely separate read path in ORC to allow for caching and > parallelization of reads and processing. This path does not support ACID. As > far as I remember ACID logic is embedded inside ORC format; we need to > refactor it to be on top of some interface, if practical; or just port it to > LLAP read path. > Another consideration is how the logic will work with cache. The cache is > currently low-level (CB-level in ORC), so we could just use it to read bases > and deltas (deltas should be cached with higher priority) and merge as usual. > We could also cache merged representation in future. -- This message was sent by Atlassian JIRA (v6.4.14#64029)