[jira] [Created] (HIVE-23489) dynamic partition failed use insert overwrite

2020-05-17 Thread shining (Jira)
shining created HIVE-23489:
--

 Summary: dynamic partition failed use insert overwrite
 Key: HIVE-23489
 URL: https://issues.apache.org/jira/browse/HIVE-23489
 Project: Hive
  Issue Type: Bug
  Components: Metastore
Affects Versions: 2.1.1
Reporter: shining


SQL: insert *overwrite *table dzdz_fpxx_dzfp partition(nian) select * from 
test.dzdz_fpxx_dzfp

{noformat}
create table dzdz_fpxx_dzfp  (
FPDM string,
FPHM string,
KPRQ timestamp,
FPLY string) 
partitioned by (nian string) 
stored as parquet;


create table test.dzdz_fpxx_dzfp (
FPDM string,
FPHM string,
KPRQ timestamp,
FPLY string,
nian string) 
stored as textfile
location "/origin/data/dzfp_origin/"
{noformat}

The execute insert sql: 
SQL: insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from 
test.dzdz_fpxx_dzfp


{noformat}
INFO  : Compiling 
command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db): 
insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from 
test.dzdz_fpxx_dzfp
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: 
Schema(fieldSchemas:[FieldSchema(name:dzdz_fpxx_dzfp.fpdm, type:string, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fphm, type:string, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.kprq, type:timestamp, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.tspz_dm, type:string, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fpzt_bz, type:string, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.fply, type:string, 
comment:null), FieldSchema(name:dzdz_fpxx_dzfp.nian, type:string, 
comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db); Time 
taken: 1.719 seconds
INFO  : Executing 
command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db): 
insert overwrite table dzdz_fpxx_dzfp partition(nian) select * from 
test.dzdz_fpxx_dzfp
WARN  :
INFO  : Query ID = hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db
INFO  : Total jobs = 3
INFO  : Launching Job 1 out of 3
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1589451904439_0049
INFO  : Executing with tokens: []
INFO  : The url to track the job: 
http://qcj37.hde.com:8088/proxy/application_1589451904439_0049/
INFO  : Starting Job = job_1589451904439_0049, Tracking URL = 
http://qcj37.hde.com:8088/proxy/application_1589451904439_0049/
INFO  : Kill Command = /usr/hdp/current/hadoop-client/bin/hadoop job  -kill 
job_1589451904439_0049
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 0
INFO  : 2020-05-19 10:06:56,823 Stage-1 map = 0%,  reduce = 0%
INFO  : 2020-05-19 10:07:06,317 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 4.3 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 300 msec
INFO  : Ended Job = job_1589451904439_0049
INFO  : Starting task [Stage-7:CONDITIONAL] in serial mode
INFO  : Stage-4 is selected by condition resolver.
INFO  : Stage-3 is filtered out by condition resolver.
INFO  : Stage-5 is filtered out by condition resolver.
INFO  : Starting task [Stage-4:MOVE] in serial mode
INFO  : Moving data to directory 
hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1
 from 
hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-10002
INFO  : Starting task [Stage-0:MOVE] in serial mode
INFO  : Loading data to table default.dzdz_fpxx_dzfp partition (nian=null) from 
hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1
INFO  :

ERROR : FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.MoveTask. Exception when loading 2 in table 
dzdz_fpxx_dzfp with 
loadPath=hdfs://mycluster/warehouse/tablespace/managed/hive/dzdz_fpxx_dzfp/.hive-staging_hive_2020-05-19_10-04-12_468_7695595367555279265-3/-ext-1
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-1: Map: 1   Cumulative CPU: 4.3 sec   HDFS Read: 8015 HDFS 
Write: 2511 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 300 msec
INFO  : Completed executing 
command(queryId=hive_20200519100412_12f2b39c-45f4-4f4c-9261-32ca86fa28db); Time 
taken: 297.35 seconds
2020-05-19 10:11:54,635 DEBUG transport.TSaslTransport 
(TSaslTransport.java:flush(496)) - writing data length: 117
2020-05-19 10:11:54,638 DEBUG transport.TSaslTransport 
(TSaslTransport.java:readFrame(457)) - 

Review Request 72522: HIVE-23488: Optimise PartitionManagementTask::Msck::repair

2020-05-17 Thread Rajesh Balamohan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72522/
---

Review request for hive, Ashutosh Chauhan and prasanthj.


Repository: hive-git


Description
---

Ends up fetching table information twice. Patch fixes this.


Diffs
-

  ql/src/test/org/apache/hadoop/hive/ql/metadata/TestHiveMetaStoreChecker.java 
520eb1b5c0 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java
 6f4400a8ef 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java
 f4e109d1b0 


Diff: https://reviews.apache.org/r/72522/diff/1/


Testing
---


Thanks,

Rajesh Balamohan



Review Request 72521: HIVE-23487: Optimise PartitionManagementTask

2020-05-17 Thread Rajesh Balamohan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72521/
---

Review request for hive, Ashutosh Chauhan and prasanthj.


Repository: hive-git


Description
---

Msck.init for every table takes more CPU time than the actual table repair. 
This was observed on a system which had lots of DB and tables.


Diffs
-

  ql/src/java/org/apache/hadoop/hive/ql/ddl/misc/msck/MsckOperation.java 
c05d699bd8 
  
ql/src/test/org/apache/hadoop/hive/ql/exec/TestMsckCreatePartitionsInBatches.java
 7821f40a82 
  
ql/src/test/org/apache/hadoop/hive/ql/exec/TestMsckDropPartitionsInBatches.java 
8be31128a1 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java
 f4e109d1b0 
  
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/PartitionManagementTask.java
 e4488f4709 


Diff: https://reviews.apache.org/r/72521/diff/1/


Testing
---


Thanks,

Rajesh Balamohan



[jira] [Created] (HIVE-23488) Optimise PartitionManagementTask::Msck::repair

2020-05-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23488:
---

 Summary: Optimise PartitionManagementTask::Msck::repair
 Key: HIVE-23488
 URL: https://issues.apache.org/jira/browse/HIVE-23488
 Project: Hive
  Issue Type: Improvement
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-05-18 at 5.06.15 AM.png

Ends up fetching table information twice.

!Screenshot 2020-05-18 at 5.06.15 AM.png|width=1084,height=754!

 

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/Msck.java#L113]

[https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java#L234]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23487) Optimise PartitionManagementTask

2020-05-17 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-23487:
---

 Summary: Optimise PartitionManagementTask
 Key: HIVE-23487
 URL: https://issues.apache.org/jira/browse/HIVE-23487
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Reporter: Rajesh Balamohan
 Attachments: Screenshot 2020-05-18 at 4.19.48 AM.png

Msck.init for every table takes more time than the actual table repair. This 
was observed on a system which had lots of DB and tables.

 

  !Screenshot 2020-05-18 at 4.19.48 AM.png|width=1014,height=732!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23486) Compiler support for row-level filtering on transactional deletes

2020-05-17 Thread Panagiotis Garefalakis (Jira)
Panagiotis Garefalakis created HIVE-23486:
-

 Summary: Compiler support for row-level filtering on transactional 
deletes
 Key: HIVE-23486
 URL: https://issues.apache.org/jira/browse/HIVE-23486
 Project: Hive
  Issue Type: Improvement
Reporter: Panagiotis Garefalakis
Assignee: Panagiotis Garefalakis






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Replace ptest with hive-test-kube

2020-05-17 Thread Zoltan Haindrich

Hello all!

The proposed system have become more stable lately - and I think I've solved a 
few sources of flakiness.
To be really usable I also wanted to add a way to dynamically enable/disable a set of tests (for example the replication tests take ~7 hours to execute from the total of 24 
hours - and they are also a bit unstable, so not running them when not neccesary would be beneficial in multiple ways) - but to do this the best would be to throw in 
junit5; unfortunately the current ptest installation uses maven 3.0.5 which doesn't like these kind of things - so instead of hacking a fix for that I've removed it 
from the dev branch for now.


I would like to propose to start an evaluation phase of the new test 
procedures(INFRA-20269)
The process would look something like this:
* someone opens a PR - the tests will be run on the changes
* on every active branches the tests will run from time to time
  * this will produce a bunch of test runs on the master branch as well ; which 
will show how well the tests behave on the master branch without any patches
* runs on branches (PRs or active development branches(eg:master)) will be rate 
limited to 5 builds/day
* at most ~4 builds at a time - to maximize resource usage
* turnaround time for a build is right now 2 hours - which I feel like a 
balanced choice between speed/response time

Possible future benefits:
* toggle features using github tags
* optional testgroups (metastore/replication) tests
* ability to run the metastore verification tests
* possibility to add smoke tests

To enable this I will have to finish the HIVE-22942 ticket - beyond the new 
Jenkinsfile which defines the full logic;
although I've sinked a lot of time into fixing all kind of flaky tests I would 
would like to disable around ~25 tests.

I also would like to propose a method to verify the stability of a single test: 
run it a 100 times in series at the same place where the precommit tests are 
running.
This will put the bar high enough that only totally stable tests could satisfy 
it (a 99% stable test has 36% chance to pass this without being caught :D)
After this will be in service it could be used to: validate that an existing test is unstable (before disabling it) - and then used again to prove that it got fixed during 
re-enabling it.


Please let me know what you think!

cheers,
Zoltan



On 4/29/20 4:28 PM, Zoltan Haindrich wrote:

Hey All!

I was planning to replace the ptest stuff with something less complex for a 
while now - I see that we struggle a lot because of ptest is more complicated 
than it should be...
It would be much better if it would be constructed from well made existing CI 
piece. - because of that I've started working on [1] a few months ago.

It has it's pros and cons...but it's not the same as the existing ptest stuff.
I've collected some infos about how it compares against the existing one - but 
it became too long so I've moved it into a google docs document at [3].

It's not yet ready... I still have some remaining problems/concerns/etc
* what do you think about changing to a github PR based workflow?
* it will not support at all things like "isolation" - so we will have to make 
our tests work with eachother without bending the rules...
* I've tried to overcommit the cpu resources which creates a more noisy environment for the actual tests - this squeezes out some new problems which should be fixed before 
this could be enabled.
* for every PR the first run is somewhat sub-optimal...there are some reasons for this - the actually used resources are the same; but the overall execution time is not 
optimal; I could accept this as a compromise because right now I wait >24 hours for a precommit run.


It's deployed at [2] and anyone can start a testrun on it:
* merge my HIVE-22942-ptest-alt branch from [4] into your branch
* open a PR against my hive repo on github [5]

cheers,
Zoltan


[1] https://issues.apache.org/jira/browse/HIVE-22942
[2] http://34.66.156.144:8080/job/hive-precommit
[3] 
https://docs.google.com/document/d/1dhL5B-eBvYNKEsNV3kE6RrkV5w-LtDgw5CtHV5pdoX4/edit?usp=sharing
[4] https://github.com/kgyrtkirk/hive/tree/HIVE-22942-ptest-alt
[5] https://github.com/kgyrtkirk/hive/


[jira] [Created] (HIVE-23485) Bound GroupByOperator stats using largest NDV among columns

2020-05-17 Thread Stamatis Zampetakis (Jira)
Stamatis Zampetakis created HIVE-23485:
--

 Summary: Bound GroupByOperator stats using largest NDV among 
columns
 Key: HIVE-23485
 URL: https://issues.apache.org/jira/browse/HIVE-23485
 Project: Hive
  Issue Type: Improvement
Reporter: Stamatis Zampetakis
Assignee: Stamatis Zampetakis


Consider the following SQL query:

{code:sql}
select id, name from person group by id, name;
{code}

and assume that the person table contains the following tuples:

{code:sql}
insert into person values (0, 'A') ;
insert into person values (1, 'A') ;
insert into person values (2, 'B') ;
insert into person values (3, 'B') ;
insert into person values (4, 'B') ;
insert into person values (5, 'C') ;
{code}

If we know the number of distinct values (NDV) for all columns in the group by 
clause then we can infer a lower bound for the total number of rows by taking 
the maximun NDV of the involved columns. 

Currently the query in the scenario above has the following plan:

{noformat}
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
limit:-1
Stage-1
  Reducer 2 vectorized
  File Output Operator [FS_11]
Group By Operator [GBY_10] (rows=3 width=92)
  Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
<-Map 1 [SIMPLE_EDGE] vectorized
  SHUFFLE [RS_9]
PartitionCols:_col0, _col1
Group By Operator [GBY_8] (rows=3 width=92)
  Output:["_col0","_col1"],keys:id, name
  Select Operator [SEL_7] (rows=6 width=92)
Output:["id","name"]
TableScan [TS_0] (rows=6 width=92)
  
default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}

Observe that the stats for group by report 3 rows but given that the ID 
attribute is part of the aggregation the rows cannot be less than 6.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)