[jira] [Created] (HIVE-23489) dynamic partition failed use insert overwrite
shining created HIVE-23489: -- Summary: dynamic partition failed use insert overwrite Key: HIVE-23489 URL: https://issues.apache.org/jira/browse/HIVE-23489 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 2.1.1 Reporter: shining SQL: insert *overwrite *table dzdz_fpxx_dzfp partition(nian) select * from test.dzdz_fpxx_dzfp {noformat} create table dzdz_fpxx_dzfp ( FPDM string, FPHM string, KPRQ timestamp, FPLY string) partitioned by (nian string) stored as parquet; create table test.dzdz_fpxx_dzfp ( FPDM string, FPHM string, KPRQ timestamp, FPLY string, nian string) stored as textfile location "/origin/data/dzfp_origin/" {noformat} The execute insert sql: SQL: insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from test.dzdz_fpxx_dzfp {noformat} INFO : Compiling command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db): insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from test.dzdz_fpxx_dzfp INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:dzdz_fpxx_dzfp.fpdm, type:string, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fphm, type:string, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.kprq, type:timestamp, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.tspz_dm, type:string, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fpzt_bz, type:string, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fply, type:string, comment:null), FieldSchema(name:dzdz_fpxx_dzfp.nian, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db); Time taken: 1.719 seconds INFO : Executing command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db): insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from test.dzdz_fpxx_dzfp WARN : INFO : Query ID = hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db INFO : Total jobs = 3 INFO : Launching Job 1 out of 3 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks is set to 0 since there's no reduce operator INFO : number of splits:1 INFO : Submitting tokens for job: job_1589451904439_0049 INFO : Executing with tokens: [] INFO : The url to track the job: http://qcj37.hde.com:8088/proxy/application_1589451904439_0049/ INFO : Starting Job = job_1589451904439_0049, Tracking URL = http://qcj37.hde.com:8088/proxy/application_1589451904439_0049/ INFO : Kill Command = /usr/hdp/current/hadoop-client/bin/hadoop job -kill job_1589451904439_0049 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 INFO : 2020-05-19 10:06:56,823 Stage-1 map = 0%, reduce = 0% INFO : 2020-05-19 10:07:06,317 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.3 sec INFO : MapReduce Total cumulative CPU time: 4 seconds 300 msec INFO : Ended Job = job_1589451904439_0049 INFO : Starting task [Stage-7:CONDITIONAL] in serial mode INFO : Stage-4 is selected by condition resolver. INFO : Stage-3 is filtered out by condition resolver. INFO : Stage-5 is filtered out by condition resolver. INFO : Starting task [Stage-4:MOVE] in serial mode INFO : Moving data to directory hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1 from hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-10002 INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table default.dzdz_fpxx_dzfp partition (nian=null) from hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1 INFO : ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Exception when loading 2 in table dzdz_fpxx_dzfp with loadPath=hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Cumulative CPU: 4.3 sec HDFS Read: 8015 HDFS Write: 2511 SUCCESS INFO : Total MapReduce CPU Time Spent: 4 seconds 300 msec INFO : Completed executing command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db); Time taken: 297.35 seconds 2020-05-19 10:11:54,635 DEBUG transport.TSaslTransport (TSaslTransport.java:flush(496)) - writing data length: 117 2020-05-19 10:11:54,638 DEBUG transport.TSaslTransport (TSaslTransport.java:readFrame(457)) -
Review Request 72522: HIVE-23488: Optimise PartitionManagementTask::Msck::repair
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72522/ --- Review request for hive, Ashutosh Chauhan and prasanthj. Repository: hive-git Description --- Ends up fetching table information twice. Patch fixes this. Diffs - ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 520eb1b5c0 standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java 6f4400a8ef standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java f4e109d1b0 Diff: https://reviews.apache.org/r/72522/diff/1/ Testing --- Thanks, Rajesh Balamohan
Review Request 72521: HIVE-23487: Optimise PartitionManagementTask
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/72521/ --- Review request for hive, Ashutosh Chauhan and prasanthj. Repository: hive-git Description --- Msck.init for every table takes more CPU time than the actual table repair. This was observed on a system which had lots of DB and tables. Diffs - ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckOperation.java c05d699bd8 ql/src/test/org/apache/hadoop/hive/ql/exec/TestMsckCreatePartitionsInBatches.java 7821f40a82 ql/src/test/org/apache/hadoop/hive/ql/exec/TestMsckDropPartitionsInBatches.java 8be31128a1 standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java f4e109d1b0 standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/PartitionManagementTask.java e4488f4709 Diff: https://reviews.apache.org/r/72521/diff/1/ Testing --- Thanks, Rajesh Balamohan
[jira] [Created] (HIVE-23488) Optimise PartitionManagementTask::Msck::repair
Rajesh Balamohan created HIVE-23488: --- Summary: Optimise PartitionManagementTask::Msck::repair Key: HIVE-23488 URL: https://issues.apache.org/jira/browse/HIVE-23488 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Attachments: Screenshot 2020-05-18 at 5.06.15 AM.png Ends up fetching table information twice. !Screenshot 2020-05-18 at 5.06.15 AM.png|width=1084,height=754! [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L113] [https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java#L234] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23487) Optimise PartitionManagementTask
Rajesh Balamohan created HIVE-23487: --- Summary: Optimise PartitionManagementTask Key: HIVE-23487 URL: https://issues.apache.org/jira/browse/HIVE-23487 Project: Hive Issue Type: Improvement Components: Metastore Reporter: Rajesh Balamohan Attachments: Screenshot 2020-05-18 at 4.19.48 AM.png Msck.init for every table takes more time than the actual table repair. This was observed on a system which had lots of DB and tables. !Screenshot 2020-05-18 at 4.19.48 AM.png|width=1014,height=732! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HIVE-23486) Compiler support for row-level filtering on transactional deletes
Panagiotis Garefalakis created HIVE-23486: - Summary: Compiler support for row-level filtering on transactional deletes Key: HIVE-23486 URL: https://issues.apache.org/jira/browse/HIVE-23486 Project: Hive Issue Type: Improvement Reporter: Panagiotis Garefalakis Assignee: Panagiotis Garefalakis -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: [DISCUSS] Replace ptest with hive-test-kube
Hello all! The proposed system have become more stable lately - and I think I've solved a few sources of flakiness. To be really usable I also wanted to add a way to dynamically enable/disable a set of tests (for example the replication tests take ~7 hours to execute from the total of 24 hours - and they are also a bit unstable, so not running them when not neccesary would be beneficial in multiple ways) - but to do this the best would be to throw in junit5; unfortunately the current ptest installation uses maven 3.0.5 which doesn't like these kind of things - so instead of hacking a fix for that I've removed it from the dev branch for now. I would like to propose to start an evaluation phase of the new test procedures(INFRA-20269) The process would look something like this: * someone opens a PR - the tests will be run on the changes * on every active branches the tests will run from time to time * this will produce a bunch of test runs on the master branch as well ; which will show how well the tests behave on the master branch without any patches * runs on branches (PRs or active development branches(eg:master)) will be rate limited to 5 builds/day * at most ~4 builds at a time - to maximize resource usage * turnaround time for a build is right now 2 hours - which I feel like a balanced choice between speed/response time Possible future benefits: * toggle features using github tags * optional testgroups (metastore/replication) tests * ability to run the metastore verification tests * possibility to add smoke tests To enable this I will have to finish the HIVE-22942 ticket - beyond the new Jenkinsfile which defines the full logic; although I've sinked a lot of time into fixing all kind of flaky tests I would would like to disable around ~25 tests. I also would like to propose a method to verify the stability of a single test: run it a 100 times in series at the same place where the precommit tests are running. This will put the bar high enough that only totally stable tests could satisfy it (a 99% stable test has 36% chance to pass this without being caught :D) After this will be in service it could be used to: validate that an existing test is unstable (before disabling it) - and then used again to prove that it got fixed during re-enabling it. Please let me know what you think! cheers, Zoltan On 4/29/20 4:28 PM, Zoltan Haindrich wrote: Hey All! I was planning to replace the ptest stuff with something less complex for a while now - I see that we struggle a lot because of ptest is more complicated than it should be... It would be much better if it would be constructed from well made existing CI piece. - because of that I've started working on [1] a few months ago. It has it's pros and cons...but it's not the same as the existing ptest stuff. I've collected some infos about how it compares against the existing one - but it became too long so I've moved it into a google docs document at [3]. It's not yet ready... I still have some remaining problems/concerns/etc * what do you think about changing to a github PR based workflow? * it will not support at all things like "isolation" - so we will have to make our tests work with eachother without bending the rules... * I've tried to overcommit the cpu resources which creates a more noisy environment for the actual tests - this squeezes out some new problems which should be fixed before this could be enabled. * for every PR the first run is somewhat sub-optimal...there are some reasons for this - the actually used resources are the same; but the overall execution time is not optimal; I could accept this as a compromise because right now I wait >24 hours for a precommit run. It's deployed at [2] and anyone can start a testrun on it: * merge my HIVE-22942-ptest-alt branch from [4] into your branch * open a PR against my hive repo on github [5] cheers, Zoltan [1] https://issues.apache.org/jira/browse/HIVE-22942 [2] http://34.66.156.144:8080/job/hive-precommit [3] https://docs.google.com/document/d/1dhL5B-eBvYNKEsNV3kE6RrkV5w-LtDgw5CtHV5pdoX4/edit?usp=sharing [4] https://github.com/kgyrtkirk/hive/tree/HIVE-22942-ptest-alt [5] https://github.com/kgyrtkirk/hive/
[jira] [Created] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns
Stamatis Zampetakis created HIVE-23485: -- Summary: Bound GroupByOperator stats using largest NDV among columns Key: HIVE-23485 URL: https://issues.apache.org/jira/browse/HIVE-23485 Project: Hive Issue Type: Improvement Reporter: Stamatis Zampetakis Assignee: Stamatis Zampetakis Consider the following SQL query: {code:sql} select id, name from person group by id, name; {code} and assume that the person table contains the following tuples: {code:sql} insert into person values (0, 'A') ; insert into person values (1, 'A') ; insert into person values (2, 'B') ; insert into person values (3, 'B') ; insert into person values (4, 'B') ; insert into person values (5, 'C') ; {code} If we know the number of distinct values (NDV) for all columns in the group by clause then we can infer a lower bound for the total number of rows by taking the maximun NDV of the involved columns. Currently the query in the scenario above has the following plan: {noformat} Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 vectorized File Output Operator [FS_11] Group By Operator [GBY_10] (rows=3 width=92) Output:["_col0","_col1"],keys:KEY._col0, KEY._col1 <-Map 1 [SIMPLE_EDGE] vectorized SHUFFLE [RS_9] PartitionCols:_col0, _col1 Group By Operator [GBY_8] (rows=3 width=92) Output:["_col0","_col1"],keys:id, name Select Operator [SEL_7] (rows=6 width=92) Output:["id","name"] TableScan [TS_0] (rows=6 width=92) default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat} Observe that the stats for group by report 3 rows but given that the ID attribute is part of the aggregation the rows cannot be less than 6. -- This message was sent by Atlassian Jira (v8.3.4#803005)