[jira] [Updated] (HIVE-22148) S3A delegation tokens are not added in the job config of the Compactor.

2019-08-29 Thread anishek (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-22148:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks for the patch [~harishjp] . Committed to master!

> S3A delegation tokens are not added in the job config of the Compactor.
> ---
>
> Key: HIVE-22148
> URL: https://issues.apache.org/jira/browse/HIVE-22148
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22148.01.patch
>
>
> Compactor job does not have the s3 delegation tokens, required to contact s3 
> and causes the job to fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HIVE-22148) S3A delegation tokens are not added in the job config of the Compactor.

2019-08-27 Thread anishek (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16917439#comment-16917439
 ] 

anishek commented on HIVE-22148:


+1 pending tests

> S3A delegation tokens are not added in the job config of the Compactor.
> ---
>
> Key: HIVE-22148
> URL: https://issues.apache.org/jira/browse/HIVE-22148
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22148.01.patch
>
>
> Compactor job does not have the s3 delegation tokens, required to contact s3 
> and causes the job to fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (HIVE-22148) S3A delegation tokens are not added in the job config of the Compactor.

2019-08-27 Thread anishek (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-22148:
---
Fix Version/s: 4.0.0

> S3A delegation tokens are not added in the job config of the Compactor.
> ---
>
> Key: HIVE-22148
> URL: https://issues.apache.org/jira/browse/HIVE-22148
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-22148.01.patch
>
>
> Compactor job does not have the s3 delegation tokens, required to contact s3 
> and causes the job to fail.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (HIVE-21992) REPL DUMP throws NPE when dumping Create Function event.

2019-07-15 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885021#comment-16885021
 ] 

anishek commented on HIVE-21992:


[~sankarh] do you have a sample Create function that leads to the above ?

> REPL DUMP throws NPE when dumping Create Function event.
> 
>
> Key: HIVE-21992
> URL: https://issues.apache.org/jira/browse/HIVE-21992
> Project: Hive
>  Issue Type: Bug
>  Components: repl
>Affects Versions: 4.0.0, 3.2.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: DR, Replication
>
> REPL DUMP throws NPE while dumping Create Function event.It seems, null check 
> is missing for function.getResourceUris().
> {code}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.repl.dump.io.FunctionSerializer.writeTo(FunctionSerializer.java:54)
> at 
> org.apache.hadoop.hive.ql.parse.repl.dump.events.CreateFunctionHandler.handle(CreateFunctionHandler.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.dumpEvent(ReplDumpTask.java:304)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.incrementalDump(ReplDumpTask.java:231)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.execute(ReplDumpTask.java:121)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:103)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2727)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2394)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2066)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1764)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1758)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:226)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
> at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:324)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:342)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> FAILED: Execution Error, return code 4 from 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask. 
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.repl.dump.io.FunctionSerializer.writeTo(FunctionSerializer.java:54)
> at 
> org.apache.hadoop.hive.ql.parse.repl.dump.events.CreateFunctionHandler.handle(CreateFunctionHandler.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.dumpEvent(ReplDumpTask.java:304)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.incrementalDump(ReplDumpTask.java:231)
> at 
> org.apache.hadoop.hive.ql.exec.repl.ReplDumpTask.execute(ReplDumpTask.java:121)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
> at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:103)
> at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2727)
> at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2394)
> at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2066)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1764)
> at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1758)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:226)
> at 
> org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87)
> at 
> org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:324)
> at 

[jira] [Commented] (HIVE-21932) IndexOutOfRangeExeption in FileChksumIterator

2019-06-28 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874732#comment-16874732
 ] 

anishek commented on HIVE-21932:


+1

> IndexOutOfRangeExeption in FileChksumIterator
> -
>
> Key: HIVE-21932
> URL: https://issues.apache.org/jira/browse/HIVE-21932
> Project: Hive
>  Issue Type: Bug
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Attachments: HIVE-21932.01.patch
>
>
> According to definition of {{InsertEventRequestData}} in 
> {{hive_metastore.thrift}} the {{filesAddedChecksum}} is a optional field. But 
> the FileChksumIterator does not handle it correctly when a client fires a 
> insert event which does not have file checksums. The issue is that 
> {{InsertEvent}} class initializes fileChecksums list to a empty arrayList to 
> the following check will never come into play
> {noformat}
> result = ReplChangeManager.encodeFileUri(files.get(i), chksums != null ? 
> chksums.get(i) : null,
> subDirs != null ? subDirs.get(i) : null);
> {noformat}
> The chksums check above should include a {{!chksums.isEmpty()}} check as well 
> in the above line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21651) Move protobuf serde into hive-exec.

2019-04-28 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16828901#comment-16828901
 ] 

anishek commented on HIVE-21651:


+1 . Patch pushed to master, Thanks [~harishjp] !

> Move protobuf serde into hive-exec.
> ---
>
> Key: HIVE-21651
> URL: https://issues.apache.org/jira/browse/HIVE-21651
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Attachments: HIVE-21651.01.patch, HIVE-21651.02.patch
>
>
> The serde and input format is not accessible without doing an add jar or 
> modifying hive aux libs. Moving it to hive-exec will let us use the serde.
>  
> Can't move the serde to hive/serde since it depends on ProtobufMessageWriter 
> which is in hive-exec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21651) Move protobuf serde into hive-exec.

2019-04-28 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21651:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Move protobuf serde into hive-exec.
> ---
>
> Key: HIVE-21651
> URL: https://issues.apache.org/jira/browse/HIVE-21651
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Attachments: HIVE-21651.01.patch, HIVE-21651.02.patch
>
>
> The serde and input format is not accessible without doing an add jar or 
> modifying hive aux libs. Moving it to hive-exec will let us use the serde.
>  
> Can't move the serde to hive/serde since it depends on ProtobufMessageWriter 
> which is in hive-exec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21651) Move protobuf serde into hive-exec.

2019-04-28 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21651:
---
Fix Version/s: 4.0.0

> Move protobuf serde into hive-exec.
> ---
>
> Key: HIVE-21651
> URL: https://issues.apache.org/jira/browse/HIVE-21651
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21651.01.patch, HIVE-21651.02.patch
>
>
> The serde and input format is not accessible without doing an add jar or 
> modifying hive aux libs. Moving it to hive-exec will let us use the serde.
>  
> Can't move the serde to hive/serde since it depends on ProtobufMessageWriter 
> which is in hive-exec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21500) Disable conversion of managed table to external and vice versa at source.

2019-04-16 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818859#comment-16818859
 ] 

anishek commented on HIVE-21500:


+1

> Disable conversion of managed table to external and vice versa at source.
> -
>
> Key: HIVE-21500
> URL: https://issues.apache.org/jira/browse/HIVE-21500
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 4.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: DR, Replication, pull-request-available
> Attachments: HIVE-21500.01.patch, HIVE-21500.02.patch, 
> HIVE-21500.03.patch
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Couple of scenarios for Hive2 to Hive3(strict managed tables enabled) 
> replication where managed table is converted to external at source. 
> *Scenario-1: (ACID/MM table converted to external at target)*
> 1. Create non-ACID ORC format table.
> 2. Insert some rows
> 3. Replicate this create event which creates ACID table at target (due to 
> migration rule). Each insert event adds transactional metadata in HMS 
> corresponding to the current table.
> 4. Convert table to external table using ALTER command at source.
> *Scenario-2: (External table at target changes table location)*
> 1. Create non-ACID avro format table.
> 2. Insert some rows
> 3. Replicate this create event which creates external table at target (due to 
> migration rule). The data path is chosen under default external warehouse 
> directory.
> 4. Convert table to external table using ALTER command at source.
> It is unable to convert an ACID table to external table at target. Also, it 
> is hard to detect what would be the table type at target when perform this 
> ALTER table operation at source.
> So, it is decided to disable conversion of managed table at source (Hive2) to 
> EXTERNAL or vice-versa if the DB is enabled for replication and strict 
> managed is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21602) Dropping an external table created by migration case should delete the data directory.

2019-04-12 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816010#comment-16816010
 ] 

anishek commented on HIVE-21602:


+1 

> Dropping an external table created by migration case should delete the data 
> directory.
> --
>
> Key: HIVE-21602
> URL: https://issues.apache.org/jira/browse/HIVE-21602
> Project: Hive
>  Issue Type: Test
>  Components: repl, Test
>Affects Versions: 4.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Minor
>  Labels: DR, Replication, Test, pull-request-available
> Attachments: HIVE-21602.01.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For external table, if the table is dropped, the location is not removed. But 
> If the source table is managed and at target the table is converted to 
> external, then the table location should be removed if the table is dropped.
> Replication flow should set additional parameter 
> "external.table.purge"="true" for migration to external table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21362) Add an input format and serde to read from protobuf files.

2019-03-15 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21362:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed to master, Thanks [~harishjp] for the patch and [~jdere] for 
review.

> Add an input format and serde to read from protobuf files.
> --
>
> Key: HIVE-21362
> URL: https://issues.apache.org/jira/browse/HIVE-21362
> Project: Hive
>  Issue Type: Task
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Critical
> Attachments: HIVE-21362.01.patch, HIVE-21362.02.patch, 
> HIVE-21362.03.patch, HIVE-21362.04.patch, HIVE-21362.05.patch
>
>
> Logs are being generated using the HiveProtoLoggingHook and tez 
> ProtoHistoryLoggingService. These are sequence files written using 
> ProtobufMessageWritable.
> Implement a SerDe and input format to be able to create tables using these 
> files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21296) Dropping varchar partition throw exception

2019-02-21 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773945#comment-16773945
 ] 

anishek commented on HIVE-21296:


+1 

> Dropping varchar partition throw exception
> --
>
> Key: HIVE-21296
> URL: https://issues.apache.org/jira/browse/HIVE-21296
> Project: Hive
>  Issue Type: Bug
>Reporter: Daniel Dai
>Assignee: Daniel Dai
>Priority: Major
> Attachments: HIVE-21296.1.patch, HIVE-21296.2.patch
>
>
> Drop partition fail if the partition column is varchar. For example:
> {code:java}
> create external table BS_TAB_0_211494(c_date_SAD_29630 date) PARTITIONED BY 
> (part_varchar_37229 varchar(56)) STORED AS orc;
> INSERT INTO BS_TAB_0_211494 values('4740-04-04','BrNTRsv3c');
> ALTER TABLE BS_TAB_0_211494 DROP PARTITION 
> (part_varchar_37229='BrNTRsv3c');{code}
> Exception:
> {code}
> 2019-02-19T22:12:55,843  WARN [HiveServer2-Handler-Pool: Thread-42] 
> thrift.ThriftCLIService: Error executing statement: 
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED: SemanticException [Error 10006]: Partition not found 
> (part_varchar_37229 = 'BrNTRsv3c')
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:356)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:269)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.operation.Operation.run(Operation.java:268) 
> ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:576)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:561)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_202]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_202]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_202]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_202]
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at java.security.AccessController.doPrivileged(Native Method) 
> ~[?:1.8.0_202]
>   at javax.security.auth.Subject.doAs(Subject.java:422) ~[?:1.8.0_202]
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>  ~[hadoop-common-3.1.0.jar:?]
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at com.sun.proxy.$Proxy43.executeStatementAsync(Unknown Source) ~[?:?]
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:315)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:568)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1557)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1542)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) 
> ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>  ~[hive-service-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_202]
>   at 

[jira] [Commented] (HIVE-21268) REPL: Repl dump can output - Database, Table, Dir, last_repl_id

2019-02-13 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16767898#comment-16767898
 ] 

anishek commented on HIVE-21268:


yeh there is no table level dump support for now, so if only db level repl is 
provided then the dump_dir will be corresponding to the state of the db and 
hence no table info. the full db,table,dir,last_repl,id is good only if there 
is single table in that dump.


> REPL: Repl dump can output - Database, Table, Dir, last_repl_id
> ---
>
> Key: HIVE-21268
> URL: https://issues.apache.org/jira/browse/HIVE-21268
> Project: Hive
>  Issue Type: Improvement
>Reporter: Gopal V
>Priority: Major
>
> {code}
> INFO  : Completed executing 
> command(queryId=root_20190214061031_639e3a52-5c62-40be-a3cd-3e0b18b7b41d); 
> Time taken: 0.374 seconds
> INFO  : OK
> ++---+
> |  dump_dir  | last_repl_id  |
> ++---+
> | /user/root/repl/a74389d0-7cde-4cf4-aa40-3079a98b80a8 | 1104594   |
> ++---+
> 1 row selected (0.445 seconds)
> {code}
> is somewhat hard to associate back to the table name.
> The logs a couple of lines above actually print the operation detail.
> {code}
> INFO  : REPL::TABLE_DUMP: 
> {"dbName":"tpcds_bin_partitioned_orc_1000","tableName":"item","tableType":"MANAGED_TABLE","tablesDumpProgress":"1/38","dumpTime":1550124632}
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (HIVE-21259) HiveMetaStoreCilent.getNextNotification throws exception when no new events found

2019-02-13 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek resolved HIVE-21259.

Resolution: Incomplete

code is working correctly, looks like there is duplicate id generation for 
events, needs more investigation for it, so closing this for now. 

> HiveMetaStoreCilent.getNextNotification throws exception when no new events 
> found
> -
>
> Key: HIVE-21259
> URL: https://issues.apache.org/jira/browse/HIVE-21259
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Fix For: 4.0.0
>
>
> HiveMetastoreClient can be used to get next notifications for hiveserver2. if 
> the user has received all notifications and there were no new notifications 
> generated on the server, then the user should get an empty response, however 
> currently the client throws an IllegalStateException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21250) NPE in HiveProtoLoggingHook for eventPerFile mode.

2019-02-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21250:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch Committed to master . Thanks [~harishjp]/[~prasanth_j]!

> NPE in HiveProtoLoggingHook for eventPerFile mode.
> --
>
> Key: HIVE-21250
> URL: https://issues.apache.org/jira/browse/HIVE-21250
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Critical
> Attachments: HIVE-21250.01.patch
>
>
> When eventPerFile is enabled, writer is set to null after the first event, it 
> causes an NPE in the next path until handleTick comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21244) NPE in Hive Proto Logger

2019-02-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21244:
---
Resolution: Duplicate
Status: Resolved  (was: Patch Available)

> NPE in Hive Proto Logger
> 
>
> Key: HIVE-21244
> URL: https://issues.apache.org/jira/browse/HIVE-21244
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 4.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Prasanth Jayachandran
>Priority: Major
> Attachments: HIVE-21244.1.patch, HIVE-21244.2.patch
>
>
> [https://github.com/apache/hive/blob/4ddc9de90b6de032d77709c9631ab787cef225d5/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java#L308]
>  can cause NPE. There is no uncaught exception handler for this thread. This 
> NPE can silently fail and drop the event.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21259) HiveMetaStoreCilent.getNextNotification throws exception when no new events found

2019-02-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek reassigned HIVE-21259:
--


> HiveMetaStoreCilent.getNextNotification throws exception when no new events 
> found
> -
>
> Key: HIVE-21259
> URL: https://issues.apache.org/jira/browse/HIVE-21259
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Fix For: 4.0.0
>
>
> HiveMetastoreClient can be used to get next notifications for hiveserver2. if 
> the user has received all notifications and there were no new notifications 
> generated on the server, then the user should get an empty response, however 
> currently the client throws an IllegalStateException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21250) NPE in HiveProtoLoggingHook for eventPerFile mode.

2019-02-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21250:
---
Status: Patch Available  (was: Reopened)

> NPE in HiveProtoLoggingHook for eventPerFile mode.
> --
>
> Key: HIVE-21250
> URL: https://issues.apache.org/jira/browse/HIVE-21250
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Critical
> Attachments: HIVE-21250.01.patch
>
>
> When eventPerFile is enabled, writer is set to null after the first event, it 
> causes an NPE in the next path until handleTick comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21250) NPE in HiveProtoLoggingHook for eventPerFile mode.

2019-02-12 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765985#comment-16765985
 ] 

anishek commented on HIVE-21250:


+1 looks good to me, pending test

> NPE in HiveProtoLoggingHook for eventPerFile mode.
> --
>
> Key: HIVE-21250
> URL: https://issues.apache.org/jira/browse/HIVE-21250
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Critical
> Attachments: HIVE-21250.01.patch
>
>
> When eventPerFile is enabled, writer is set to null after the first event, it 
> causes an NPE in the next path until handleTick comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HIVE-21250) NPE in HiveProtoLoggingHook for eventPerFile mode.

2019-02-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek reopened HIVE-21250:


reopening this since the change here is small.


> NPE in HiveProtoLoggingHook for eventPerFile mode.
> --
>
> Key: HIVE-21250
> URL: https://issues.apache.org/jira/browse/HIVE-21250
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Critical
> Attachments: HIVE-21250.01.patch
>
>
> When eventPerFile is enabled, writer is set to null after the first event, it 
> causes an NPE in the next path until handleTick comes back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2019-01-08 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736871#comment-16736871
 ] 

anishek commented on HIVE-20911:


Patch committed to master. Thanks for review [~sankarh]/[~ashutosh.bapat]

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch, 
> HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-08 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch, 
> HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-07 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.12.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch, 
> HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2019-01-07 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735566#comment-16735566
 ] 

anishek commented on HIVE-20911:


[~vihangk1] Thanks for the exclusion, looks like there is another test which is 
causing report publish timeout "TestReplTableMigrationWithJsonFormat"  can you 
please exclude this as well from batching

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-06 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: (was: HIVE-20911.12.patch)

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-06 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: (was: HIVE-20911.12.patch)

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-06 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.12.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-06 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.12.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch, 
> HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2019-01-04 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16733959#comment-16733959
 ] 

anishek commented on HIVE-20911:


[~vihangk1]/[~janulatha] can you please exclude TestReplWithJsonMessageFormat 
on the apache build servers from being batched ? I am not able to get a green 
build for my patch .



> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-03 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.12.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch, HIVE-20911.12.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-03 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.11.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch, HIVE-20911.11.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-03 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.10.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch, 
> HIVE-20911.10.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-03 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.09.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch, HIVE-20911.09.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-02 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.08.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch, HIVE-20911.08.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2019-01-02 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.08.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch, 
> HIVE-20911.08.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20989) JDBC - The GetOperationStatus + log can block query progress via sleep()

2018-12-21 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726550#comment-16726550
 ] 

anishek commented on HIVE-20989:


+1

> JDBC - The GetOperationStatus + log can block query progress via sleep()
> 
>
> Key: HIVE-20989
> URL: https://issues.apache.org/jira/browse/HIVE-20989
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: Gopal V
>Assignee: Sankar Hariappan
>Priority: Major
> Attachments: HIVE-20989.01.patch
>
>
> There is an exponential sleep operation inside the CLIService which can end 
> up adding tens of seconds to a query which has already completed.
> {code}
> "HiveServer2-Handler-Pool: Thread-9373" #9373 prio=5 os_prio=0 
> tid=0x7f4d5e72d800 nid=0xb634a waiting on condition [0x7f28d06a5000]
> java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at 
> org.apache.hive.service.cli.CLIService.progressUpdateLog(CLIService.java:506)
> at 
> org.apache.hive.service.cli.CLIService.getOperationStatus(CLIService.java:480)
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.GetOperationStatus(ThriftCLIService.java:695)
> at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1757)
> at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$GetOperationStatus.getResult(TCLIService.java:1742)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
> at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The sleep loop is on the server side.
> {code}
> private static final long PROGRESS_MAX_WAIT_NS = 30 * 10l;
> private JobProgressUpdate progressUpdateLog(boolean isProgressLogRequested, 
> Operation operation, HiveConf conf) {
> ...
> long startTime = System.nanoTime();
> int timeOutMs = 8;
> try {
>   while (sessionState.getProgressMonitor() == null && 
> !operation.isDone()) {
> long remainingMs = (PROGRESS_MAX_WAIT_NS - (System.nanoTime() - 
> startTime)) / 100l;
> if (remainingMs <= 0) {
>   LOG.debug("timed out and hence returning progress log as NULL");
>   return new JobProgressUpdate(ProgressMonitor.NULL);
> }
> Thread.sleep(Math.min(remainingMs, timeOutMs));
> timeOutMs <<= 1;
>   }
> {code}
> After about 16 seconds of execution of the query, the timeOutMs is 16384 ms, 
> which means the next sleep cycle is for min(30 - 17, 16) = 13.
> If the query finishes on the 17th second, the JDBC server will only respond 
> after the 30th second when it will check for operation.isDone() and return.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2018-12-20 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726455#comment-16726455
 ] 

anishek commented on HIVE-20911:


[~thejas]/[~ashutoshc] can you please help get the replication test marked to 
skip batching on the build servers ?

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2018-12-20 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725648#comment-16725648
 ] 

anishek commented on HIVE-20911:


[~vihangk1] can you please help me get another test in skip-batching from 
Replication side: test name is "TestReplWithJsonMessageFormat".



> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-19 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.07.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-19 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: (was: HIVE-20911.07.patch)

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-19 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.07.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-19 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.07.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch, HIVE-20911.07.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster. This can be 
> provided using the following configuration:
> {code}
> hive.repl.replica.external.table.base.dir=/
> {code}
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-18 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster. This can be provided 
using the following configuration:
{code}
hive.repl.replica.external.table.base.dir=/
{code}
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Bootstrap load will create a DAG, that can use parallelism in the execution 
phase, the hdfs copy related tasks are created, once the bootstrap phase is 
complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode, we create tasks for hdfs copy along with the incremental DAG. 
This requires a few basic calculations to approximately meet the configured 
value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Bootstrap load will create a DAG, that can use parallelism in the execution 
phase, the hdfs copy related tasks are created, once the bootstrap phase is 
complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode, we create tasks for hdfs copy along with the incremental DAG. 
This requires a few basic calculations to approximately meet the configured 
value in  "hive.repl.approx.max.load.tasks" 


> External Table Replication for Hive
> ---

[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2018-12-18 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723831#comment-16723831
 ] 

anishek commented on HIVE-20911:


attaching the patch once again since these test pass on local machine but fail 
in apache builds. trying to provide the fully qualified path for the avro 
schema file

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-18 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.06.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch, 
> HIVE-20911.06.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.05.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch, HIVE-20911.05.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.04.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch, HIVE-20911.04.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-14 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.03.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch, 
> HIVE-20911.03.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-13 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.02.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch, HIVE-20911.02.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-21022) Fix remote metastore tests which use ZooKeeper

2018-12-13 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719964#comment-16719964
 ] 

anishek commented on HIVE-21022:


+1 Patch committed to master, Thanks [~ashutosh.bapat]

> Fix remote metastore tests which use ZooKeeper
> --
>
> Key: HIVE-21022
> URL: https://issues.apache.org/jira/browse/HIVE-21022
> Project: Hive
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21022.01, HIVE-21022.01, HIVE-21022.01, 
> HIVE-21022.02, HIVE-21022.02.patch, HIVE-21022.03, HIVE-21022.03, 
> HIVE-21022.04, HIVE-21022.05, HIVE-21022.05
>
>
> Per [~vgarg]'s comment on HIVE-20794 at 
> https://issues.apache.org/jira/browse/HIVE-20794?focusedCommentId=16714093=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16714093,
>  the remote metatstore tests using ZooKeeper are flaky. They are failing with 
> error "Got exception: org.apache.zookeeper.KeeperException$NoNodeException 
> KeeperErrorCode = NoNode for /hs2mszktest".
> Both of these tests are using the same root namespace and hence the reason 
> for this failure could be that the root namespace becomes unavailable to one 
> test when the other drops it. The drop seems to be happening automatically 
> through TestingServer code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21022) Fix remote metastore tests which use ZooKeeper

2018-12-13 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21022:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Fix remote metastore tests which use ZooKeeper
> --
>
> Key: HIVE-21022
> URL: https://issues.apache.org/jira/browse/HIVE-21022
> Project: Hive
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-21022.01, HIVE-21022.01, HIVE-21022.01, 
> HIVE-21022.02, HIVE-21022.02.patch, HIVE-21022.03, HIVE-21022.03, 
> HIVE-21022.04, HIVE-21022.05, HIVE-21022.05
>
>
> Per [~vgarg]'s comment on HIVE-20794 at 
> https://issues.apache.org/jira/browse/HIVE-20794?focusedCommentId=16714093=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16714093,
>  the remote metatstore tests using ZooKeeper are flaky. They are failing with 
> error "Got exception: org.apache.zookeeper.KeeperException$NoNodeException 
> KeeperErrorCode = NoNode for /hs2mszktest".
> Both of these tests are using the same root namespace and hence the reason 
> for this failure could be that the root namespace becomes unavailable to one 
> test when the other drops it. The drop seems to be happening automatically 
> through TestingServer code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20911) External Table Replication for Hive

2018-12-11 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718483#comment-16718483
 ] 

anishek commented on HIVE-20911:


submitting initial patch for tests,

[~maheshk114]/[~sankarh]/[~ashutosh.bapat] please review!

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-11 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Status: Patch Available  (was: In Progress)

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (HIVE-20911) External Table Replication for Hive

2018-12-11 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-20911 started by anishek.
--
> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-11 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Attachment: HIVE-20911.01.patch

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
> Attachments: HIVE-20911.01.patch
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> In case there are different partitions in the table pointing to different 
> locations there will be multiple entries in the file for the same table name 
> with location pointing to different partition locations. For partitions 
> created in a table without specifying the _set location_ command will be 
> within the same table Data location and hence there will not be different 
> entries in the file above 
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.
> * Bootstrap load will create a DAG, that can use parallelism in the execution 
> phase, the hdfs copy related tasks are created, once the bootstrap phase is 
> complete.
> * Since incremental load results in a DAG with only sequential execution ( 
> events applied in sequence ) to effectively use the parallelism capability in 
> execution mode, we create tasks for hdfs copy along with the incremental DAG. 
> This requires a few basic calculations to approximately meet the configured 
> value in  "hive.repl.approx.max.load.tasks" 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HIVE-21029) External table replication: for existing deployments running incremental replication

2018-12-11 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek reassigned HIVE-21029:
--

Assignee: anishek

> External table replication: for existing deployments running incremental 
> replication
> 
>
> Key: HIVE-21029
> URL: https://issues.apache.org/jira/browse/HIVE-21029
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0, 3.1.0, 3.1.1
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> Existing deployments using hive replication do not get external tables 
> replicated. For such deployments to enable external table replication they 
> will have to provide a specific switch to first bootstrap external tables as 
> part of hive incremental replication, following which the incremental 
> replication will take care of further changes in external tables.
> The switch will be provided by an additional hive configuration (for ex: 
> hive.repl.bootstrap.external.tables) and is to be used in 
> {code} WITH {code}  clause of 
> {code} REPL DUMP {code} command. 
> Additionally the existing hive config _hive.repl.include.external.tables_  
> will always have to be set to "true" in the above clause. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21029) External table replication: for existing deployments running incremental replication

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21029:
---
Description: 
Existing deployments using hive replication do not get external tables 
replicated. For such deployments to enable external table replication they will 
have to provide a specific switch to first bootstrap external tables as part of 
hive incremental replication, following which the incremental replication will 
take care of further changes in external tables.

The switch will be provided by an additional hive configuration (for ex: 
hive.repl.bootstrap.external.tables) and is to be used in 
{code} WITH {code}  clause of 
{code} REPL DUMP {code} command. 

Additionally the following hive config _hive.repl.include.external.tables_  
will always have to be set to "true" in the above clause. 

  was:
Existing deployments using hive replication do not get external tables 
replicated. For such deployments to enable external table replication they will 
have to provide a specific switch to first bootstrap external tables as part of 
hive incremental replication, following which the incremental replication will 
take care of further changes in external tables.

The switch will be provided by a hive configuration and is to be used in 
{code} WITH {code}  clause of 
{code} REPL DUMP {code} command. 

Additionally the following hive config _hive.repl.include.external.tables_  
will always have to be set to "true" in the above clause. 


> External table replication: for existing deployments running incremental 
> replication
> 
>
> Key: HIVE-21029
> URL: https://issues.apache.org/jira/browse/HIVE-21029
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0, 3.1.0, 3.1.1
>Reporter: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> Existing deployments using hive replication do not get external tables 
> replicated. For such deployments to enable external table replication they 
> will have to provide a specific switch to first bootstrap external tables as 
> part of hive incremental replication, following which the incremental 
> replication will take care of further changes in external tables.
> The switch will be provided by an additional hive configuration (for ex: 
> hive.repl.bootstrap.external.tables) and is to be used in 
> {code} WITH {code}  clause of 
> {code} REPL DUMP {code} command. 
> Additionally the following hive config _hive.repl.include.external.tables_  
> will always have to be set to "true" in the above clause. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-21029) External table replication: for existing deployments running incremental replication

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-21029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-21029:
---
Description: 
Existing deployments using hive replication do not get external tables 
replicated. For such deployments to enable external table replication they will 
have to provide a specific switch to first bootstrap external tables as part of 
hive incremental replication, following which the incremental replication will 
take care of further changes in external tables.

The switch will be provided by an additional hive configuration (for ex: 
hive.repl.bootstrap.external.tables) and is to be used in 
{code} WITH {code}  clause of 
{code} REPL DUMP {code} command. 

Additionally the existing hive config _hive.repl.include.external.tables_  will 
always have to be set to "true" in the above clause. 

  was:
Existing deployments using hive replication do not get external tables 
replicated. For such deployments to enable external table replication they will 
have to provide a specific switch to first bootstrap external tables as part of 
hive incremental replication, following which the incremental replication will 
take care of further changes in external tables.

The switch will be provided by an additional hive configuration (for ex: 
hive.repl.bootstrap.external.tables) and is to be used in 
{code} WITH {code}  clause of 
{code} REPL DUMP {code} command. 

Additionally the following hive config _hive.repl.include.external.tables_  
will always have to be set to "true" in the above clause. 


> External table replication: for existing deployments running incremental 
> replication
> 
>
> Key: HIVE-21029
> URL: https://issues.apache.org/jira/browse/HIVE-21029
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.0.0, 3.1.0, 3.1.1
>Reporter: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> Existing deployments using hive replication do not get external tables 
> replicated. For such deployments to enable external table replication they 
> will have to provide a specific switch to first bootstrap external tables as 
> part of hive incremental replication, following which the incremental 
> replication will take care of further changes in external tables.
> The switch will be provided by an additional hive configuration (for ex: 
> hive.repl.bootstrap.external.tables) and is to be used in 
> {code} WITH {code}  clause of 
> {code} REPL DUMP {code} command. 
> Additionally the existing hive config _hive.repl.include.external.tables_  
> will always have to be set to "true" in the above clause. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Bootstrap load will create a DAG, that can use parallelism in the execution 
phase, the hdfs copy related tasks are created, once the bootstrap phase is 
complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode, we create tasks for hdfs copy along with the incremental DAG. 
This requires a few basic calculations to approximately meet the configured 
value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: 

[jira] [Updated] (HIVE-20708) Load (dumped) an external table as an external table on target with the same location as on the source

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20708:
---
Resolution: Duplicate
Status: Resolved  (was: Patch Available)

HIVE-20911

> Load (dumped) an external table as an external table on target with the same 
> location as on the source
> --
>
> Key: HIVE-20708
> URL: https://issues.apache.org/jira/browse/HIVE-20708
> Project: Hive
>  Issue Type: Improvement
>  Components: repl
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20708.01, HIVE-20708.02, HIVE-20708.03
>
>
> External tables are currently mapped to managed tables on target. A lot of 
> jobs in user environment are dependent upon locations specified in external 
> table definitions to run, hence, the path for external tables on the target 
> and on the source are expected to be the same. An external table being loaded 
> as a  managed table makes it difficult for failover (Controlled Failover) / 
> failback since there is no option of moving data from managed to external 
> table. So the external table replicated to target cluster needs to be kept as 
> external table with same location as on the source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 

** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
>  

[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-12-10 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
In case there are different partitions in the table pointing to different 
locations there will be multiple entries in the file for the same table name 
with location pointing to different partition locations. For partitions created 
in a table without specifying the _set location_ command will be within the 
same table Data location and hence there will not be different entries in the 
file above 

** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this 

[jira] [Updated] (HIVE-20794) Use Zookeeper for metastore service discovery

2018-11-27 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20794:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Use Zookeeper for metastore service discovery
> -
>
> Key: HIVE-20794
> URL: https://issues.apache.org/jira/browse/HIVE-20794
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20794.01, HIVE-20794.02, HIVE-20794.03, 
> HIVE-20794.03, HIVE-20794.04, HIVE-20794.05, HIVE-20794.06, HIVE-20794.07, 
> HIVE-20794.07, HIVE-20794.08, HIVE-20794.08
>
>
> Right now, multiple metastore services can be specified in 
> hive.metastore.uris configuration, but that list is static and can not be 
> modified dynamically. Use Zookeeper for dynamic service discovery of 
> metastore.
> h3. Improve ZooKeeperHiveHelper class (suggestions for name welcome)
> The Zookeeper related code (for service discovery) accesses Zookeeper 
> parameters directly from HiveConf. The class is changed so that it could be 
> used for both HiveServer2 and Metastore server and works with both the 
> configurations. Following methods from HiveServer2 are now moved into 
> ZooKeeperHiveHelper. # startZookeeperClient # addServerInstanceToZooKeeper # 
> removeServerInstanceFromZooKeeper
> h3. HiveMetaStore conf changes
>  # THRIFT_URIS (hive.metastore.uris) can also be used to specify ZooKeeper 
> quorum. When THRIFT_SERVICE_DISCOVERY_MODE 
> (hive.metastore.service.discovery.mode) is set to "zookeeper" the URIs are 
> used as ZooKeeper quorum. When it's set to be empty, the URIs are used to 
> locate the metastore directly.
>  # Here's list of Hiveserver2's parameters and their proposed metastore conf 
> counterparts. It looks odd that the Metastore related configurations do not 
> have their macros start with METASTORE, but start with THRIFT. I have just 
> followed naming convention used for other parameters.
>  ** HIVE_SERVER2_ZOOKEEPER_NAMESPACE - THRIFT_ZOOKEEPER_NAMESPACE 
> (hive.metastore.zookeeper.namespace)
>  ** HIVE_ZOOKEEPER_CLIENT_PORT - THRIFT_ZOOKEEPER_CLIENT_PORT 
> (hive.metastore.zookeeper.client.port)
>  ** HIVE_ZOOKEEPER_CONNECTION_TIMEOUT - THRIFT_ZOOKEEPER_CONNECTION_TIMEOUT - 
> (hive.metastore.zookeeper.connection.timeout)
>  ** HIVE_ZOOKEEPER_CONNECTION_MAX_RETRIES - 
> THRIFT_ZOOKEEPER_CONNECTION_MAX_RETRIES 
> (hive.metastore.zookeeper.connection.max.retries)
>  ** HIVE_ZOOKEEPER_CONNECTION_BASESLEEPTIME - 
> THRIFT_ZOOKEEPER_CONNECTION_BASESLEEPTIME 
> (hive.metastore.zookeeper.connection.basesleeptime)
>  # Additional configuration THRIFT_BIND_HOST is used to specify the host 
> address to bind Metastore service to. Right now Metastore binds to *, i.e all 
> addresses. Metastore doesn't then know which of those addresses it should add 
> to the ZooKeeper. THRIFT_BIND_HOST solves that problem. When this 
> configuration is specified the metastore server binds to that address and 
> also adds it to the ZooKeeper if dynamic service discovery mode is ZooKeeper.
> Following Hive ZK configurations seem to be related to managing locks and 
> seem irrelevant for MS ZK.
>  # HIVE_ZOOKEEPER_SESSION_TIMEOUT
>  # HIVE_ZOOKEEPER_CLEAN_EXTRA_NODES
> Since there is no configuration to be published, 
> HIVE_ZOOKEEPER_PUBLISH_CONFIGS does not have a THRIFT counterpart.
> h3. HiveMetaStore class changes
>  # startMetaStore should also register the instance with Zookeeper, when 
> configured.
>  # When shutting a metastore server down it should deregister itself from 
> Zookeeper, when configured.
>  # These changes use the refactored code described above.
> h3. HiveMetaStoreClient class changes
> When service discovery mode is zookeeper, we fetch the metatstore URIs from 
> the specified ZooKeeper and treat those as if they were specified in 
> THRIFT_URIS i.e. use the existing mechanisms to choose a metastore server to 
> connect to and establish a connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20794) Use Zookeeper for metastore service discovery

2018-11-27 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20794:
---
Fix Version/s: 4.0.0

> Use Zookeeper for metastore service discovery
> -
>
> Key: HIVE-20794
> URL: https://issues.apache.org/jira/browse/HIVE-20794
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20794.01, HIVE-20794.02, HIVE-20794.03, 
> HIVE-20794.03, HIVE-20794.04, HIVE-20794.05, HIVE-20794.06, HIVE-20794.07, 
> HIVE-20794.07, HIVE-20794.08, HIVE-20794.08
>
>
> Right now, multiple metastore services can be specified in 
> hive.metastore.uris configuration, but that list is static and can not be 
> modified dynamically. Use Zookeeper for dynamic service discovery of 
> metastore.
> h3. Improve ZooKeeperHiveHelper class (suggestions for name welcome)
> The Zookeeper related code (for service discovery) accesses Zookeeper 
> parameters directly from HiveConf. The class is changed so that it could be 
> used for both HiveServer2 and Metastore server and works with both the 
> configurations. Following methods from HiveServer2 are now moved into 
> ZooKeeperHiveHelper. # startZookeeperClient # addServerInstanceToZooKeeper # 
> removeServerInstanceFromZooKeeper
> h3. HiveMetaStore conf changes
>  # THRIFT_URIS (hive.metastore.uris) can also be used to specify ZooKeeper 
> quorum. When THRIFT_SERVICE_DISCOVERY_MODE 
> (hive.metastore.service.discovery.mode) is set to "zookeeper" the URIs are 
> used as ZooKeeper quorum. When it's set to be empty, the URIs are used to 
> locate the metastore directly.
>  # Here's list of Hiveserver2's parameters and their proposed metastore conf 
> counterparts. It looks odd that the Metastore related configurations do not 
> have their macros start with METASTORE, but start with THRIFT. I have just 
> followed naming convention used for other parameters.
>  ** HIVE_SERVER2_ZOOKEEPER_NAMESPACE - THRIFT_ZOOKEEPER_NAMESPACE 
> (hive.metastore.zookeeper.namespace)
>  ** HIVE_ZOOKEEPER_CLIENT_PORT - THRIFT_ZOOKEEPER_CLIENT_PORT 
> (hive.metastore.zookeeper.client.port)
>  ** HIVE_ZOOKEEPER_CONNECTION_TIMEOUT - THRIFT_ZOOKEEPER_CONNECTION_TIMEOUT - 
> (hive.metastore.zookeeper.connection.timeout)
>  ** HIVE_ZOOKEEPER_CONNECTION_MAX_RETRIES - 
> THRIFT_ZOOKEEPER_CONNECTION_MAX_RETRIES 
> (hive.metastore.zookeeper.connection.max.retries)
>  ** HIVE_ZOOKEEPER_CONNECTION_BASESLEEPTIME - 
> THRIFT_ZOOKEEPER_CONNECTION_BASESLEEPTIME 
> (hive.metastore.zookeeper.connection.basesleeptime)
>  # Additional configuration THRIFT_BIND_HOST is used to specify the host 
> address to bind Metastore service to. Right now Metastore binds to *, i.e all 
> addresses. Metastore doesn't then know which of those addresses it should add 
> to the ZooKeeper. THRIFT_BIND_HOST solves that problem. When this 
> configuration is specified the metastore server binds to that address and 
> also adds it to the ZooKeeper if dynamic service discovery mode is ZooKeeper.
> Following Hive ZK configurations seem to be related to managing locks and 
> seem irrelevant for MS ZK.
>  # HIVE_ZOOKEEPER_SESSION_TIMEOUT
>  # HIVE_ZOOKEEPER_CLEAN_EXTRA_NODES
> Since there is no configuration to be published, 
> HIVE_ZOOKEEPER_PUBLISH_CONFIGS does not have a THRIFT counterpart.
> h3. HiveMetaStore class changes
>  # startMetaStore should also register the instance with Zookeeper, when 
> configured.
>  # When shutting a metastore server down it should deregister itself from 
> Zookeeper, when configured.
>  # These changes use the refactored code described above.
> h3. HiveMetaStoreClient class changes
> When service discovery mode is zookeeper, we fetch the metatstore URIs from 
> the specified ZooKeeper and treat those as if they were specified in 
> THRIFT_URIS i.e. use the existing mechanisms to choose a metastore server to 
> connect to and establish a connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20794) Use Zookeeper for metastore service discovery

2018-11-27 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701406#comment-16701406
 ] 

anishek commented on HIVE-20794:


+1 ,  Committed to master, Thanks [~ashutosh.bapat]

> Use Zookeeper for metastore service discovery
> -
>
> Key: HIVE-20794
> URL: https://issues.apache.org/jira/browse/HIVE-20794
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ashutosh Bapat
>Assignee: Ashutosh Bapat
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20794.01, HIVE-20794.02, HIVE-20794.03, 
> HIVE-20794.03, HIVE-20794.04, HIVE-20794.05, HIVE-20794.06, HIVE-20794.07, 
> HIVE-20794.07, HIVE-20794.08, HIVE-20794.08
>
>
> Right now, multiple metastore services can be specified in 
> hive.metastore.uris configuration, but that list is static and can not be 
> modified dynamically. Use Zookeeper for dynamic service discovery of 
> metastore.
> h3. Improve ZooKeeperHiveHelper class (suggestions for name welcome)
> The Zookeeper related code (for service discovery) accesses Zookeeper 
> parameters directly from HiveConf. The class is changed so that it could be 
> used for both HiveServer2 and Metastore server and works with both the 
> configurations. Following methods from HiveServer2 are now moved into 
> ZooKeeperHiveHelper. # startZookeeperClient # addServerInstanceToZooKeeper # 
> removeServerInstanceFromZooKeeper
> h3. HiveMetaStore conf changes
>  # THRIFT_URIS (hive.metastore.uris) can also be used to specify ZooKeeper 
> quorum. When THRIFT_SERVICE_DISCOVERY_MODE 
> (hive.metastore.service.discovery.mode) is set to "zookeeper" the URIs are 
> used as ZooKeeper quorum. When it's set to be empty, the URIs are used to 
> locate the metastore directly.
>  # Here's list of Hiveserver2's parameters and their proposed metastore conf 
> counterparts. It looks odd that the Metastore related configurations do not 
> have their macros start with METASTORE, but start with THRIFT. I have just 
> followed naming convention used for other parameters.
>  ** HIVE_SERVER2_ZOOKEEPER_NAMESPACE - THRIFT_ZOOKEEPER_NAMESPACE 
> (hive.metastore.zookeeper.namespace)
>  ** HIVE_ZOOKEEPER_CLIENT_PORT - THRIFT_ZOOKEEPER_CLIENT_PORT 
> (hive.metastore.zookeeper.client.port)
>  ** HIVE_ZOOKEEPER_CONNECTION_TIMEOUT - THRIFT_ZOOKEEPER_CONNECTION_TIMEOUT - 
> (hive.metastore.zookeeper.connection.timeout)
>  ** HIVE_ZOOKEEPER_CONNECTION_MAX_RETRIES - 
> THRIFT_ZOOKEEPER_CONNECTION_MAX_RETRIES 
> (hive.metastore.zookeeper.connection.max.retries)
>  ** HIVE_ZOOKEEPER_CONNECTION_BASESLEEPTIME - 
> THRIFT_ZOOKEEPER_CONNECTION_BASESLEEPTIME 
> (hive.metastore.zookeeper.connection.basesleeptime)
>  # Additional configuration THRIFT_BIND_HOST is used to specify the host 
> address to bind Metastore service to. Right now Metastore binds to *, i.e all 
> addresses. Metastore doesn't then know which of those addresses it should add 
> to the ZooKeeper. THRIFT_BIND_HOST solves that problem. When this 
> configuration is specified the metastore server binds to that address and 
> also adds it to the ZooKeeper if dynamic service discovery mode is ZooKeeper.
> Following Hive ZK configurations seem to be related to managing locks and 
> seem irrelevant for MS ZK.
>  # HIVE_ZOOKEEPER_SESSION_TIMEOUT
>  # HIVE_ZOOKEEPER_CLEAN_EXTRA_NODES
> Since there is no configuration to be published, 
> HIVE_ZOOKEEPER_PUBLISH_CONFIGS does not have a THRIFT counterpart.
> h3. HiveMetaStore class changes
>  # startMetaStore should also register the instance with Zookeeper, when 
> configured.
>  # When shutting a metastore server down it should deregister itself from 
> Zookeeper, when configured.
>  # These changes use the refactored code described above.
> h3. HiveMetaStoreClient class changes
> When service discovery mode is zookeeper, we fetch the metatstore URIs from 
> the specified ZooKeeper and treat those as if they were specified in 
> THRIFT_URIS i.e. use the existing mechanisms to choose a metastore server to 
> connect to and establish a connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-11-15 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.
* Since bootstrap load will most probably create a DAG to that can use 
parallelism in the execution phase, the hdfs copy related tasks are created 
only once the bootstrap phase is complete.
* Since incremental load results in a DAG with only sequential execution ( 
events applied in sequence ) to effectively use the parallelism capability in 
execution mode we created tasks for hdfs copy along with the incremental 
related DAG, This requires a few basic calculations to approximately meet the 
configured value in  "hive.repl.approx.max.load.tasks" 

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be 

[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-11-14 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
tableName,base64Encoded(tableDataLocation)
{code}
** *repl load* will read the  *\_external\_tables\_info* to identify what 
locations are to be copied from source to target and create corresponding tasks 
for them.
* New External tables will be created with metadata only with no data copied as 
part of regular tasks while incremental load/bootstrap load.
* Bootstrap dump will also create  *\_external\_tables\_info* which will be 
used to copy data from source to target  as part of boostrap load.

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
OpearationType,tableName,base64Encoded(tableDataLocation)
{code}
where OpeartionType can be one in (ADD, REMOVE)
** *repl load* will look up all the external tables on target and remove tables 
listed with REMOVE type in the above file.
** For the remaining tables it will create tasks for the corresponding paths 
from source to target along with the existing tasks for incremental load.
* New External tables will be created with data copied as part of regular tasks 
wile incremental load, applying the base directory prefix
* Bootstrap will also create / copy these external tables as part of their 
regular workflow, applying the base directory prefix


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> tableName,base64Encoded(tableDataLocation)
> {code}
> ** *repl load* will read the  *\_external\_tables\_info* to identify what 
> locations are to be copied from source to target and create corresponding 
> tasks for them.
> * New External tables will be created with metadata only with no data copied 
> as part of regular tasks while incremental load/bootstrap load.
> * Bootstrap dump will also create  *\_external\_tables\_info* which will be 
> used to copy data from source to target  as part of boostrap load.



--
This message was sent by 

[jira] [Assigned] (HIVE-20911) External Table Replication for Hive

2018-11-13 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek reassigned HIVE-20911:
--

Assignee: anishek

> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> OpearationType,tableName,base64Encoded(tableDataLocation)
> {code}
> where OpeartionType can be one in (ADD, REMOVE)
> ** *repl load* will look up all the external tables on target and remove 
> tables listed with REMOVE type in the above file.
> ** For the remaining tables it will create tasks for the corresponding paths 
> from source to target along with the existing tasks for incremental load.
> * New External tables will be created with data copied as part of regular 
> tasks wile incremental load, applying the base directory prefix
> * Bootstrap will also create / copy these external tables as part of their 
> regular workflow, applying the base directory prefix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20911) External Table Replication for Hive

2018-11-13 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20911:
---
Description: 
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external\_tables\_info* with data in the following form 
{code}
OpearationType,tableName,base64Encoded(tableDataLocation)
{code}
where OpeartionType can be one in (ADD, REMOVE)
** *repl load* will look up all the external tables on target and remove tables 
listed with REMOVE type in the above file.
** For the remaining tables it will create tasks for the corresponding paths 
from source to target along with the existing tasks for incremental load.
* New External tables will be created with data copied as part of regular tasks 
wile incremental load, applying the base directory prefix
* Bootstrap will also create / copy these external tables as part of their 
regular workflow, applying the base directory prefix

  was:
External tables are not replicated currently as part of hive replication. As 
part of this jira we want to enable that.

Approach:
* Target cluster will have a top level base directory config that will be used 
to copy all data relevant to external tables. This will be provided via the 
*with* clause in the *repl load* command. This base path will be prefixed to 
the path of the same external table on source cluster.
* Since changes to directories on the external table can happen without hive 
knowing it, hence we cant capture the relevant events when ever new data is 
added or removed, we will have to copy the data from the source path to target 
path for external tables every time we run incremental replication.
** this will require incremental *repl dump*  to now create an additional file 
*\_external_\tables\_info* with data in the following form 
{code}
OpearationType,tableName,base64Encoded(tableDataLocation)
{code}
where OpeartionType can be one in (ADD, REMOVE)
** *repl load* will look up all the external tables on target and remove tables 
listed with REMOVE type in the above file.
** For the remaining tables it will create tasks for the corresponding paths 
from source to target along with the existing tasks for incremental load.
* New External tables will be created with data copied as part of regular tasks 
wile incremental load, applying the base directory prefix
* Bootstrap will also create / copy these external tables as part of their 
regular workflow, applying the base directory prefix


> External Table Replication for Hive
> ---
>
> Key: HIVE-20911
> URL: https://issues.apache.org/jira/browse/HIVE-20911
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Priority: Critical
> Fix For: 4.0.0
>
>
> External tables are not replicated currently as part of hive replication. As 
> part of this jira we want to enable that.
> Approach:
> * Target cluster will have a top level base directory config that will be 
> used to copy all data relevant to external tables. This will be provided via 
> the *with* clause in the *repl load* command. This base path will be prefixed 
> to the path of the same external table on source cluster.
> * Since changes to directories on the external table can happen without hive 
> knowing it, hence we cant capture the relevant events when ever new data is 
> added or removed, we will have to copy the data from the source path to 
> target path for external tables every time we run incremental replication.
> ** this will require incremental *repl dump*  to now create an additional 
> file *\_external\_tables\_info* with data in the following form 
> {code}
> OpearationType,tableName,base64Encoded(tableDataLocation)
> {code}
> where OpeartionType can be one in (ADD, REMOVE)
> ** *repl load* will look up all the external tables on target and remove 
> tables listed with REMOVE type in the above file.
> ** For the remaining tables it will create tasks for the corresponding paths 
> from source to target along with the existing tasks for incremental load.
> * New External tables will 

[jira] [Commented] (HIVE-20682) Async query execution can potentially fail if shared sessionHive is closed by master thread.

2018-11-13 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685014#comment-16685014
 ] 

anishek commented on HIVE-20682:


+1


> Async query execution can potentially fail if shared sessionHive is closed by 
> master thread.
> 
>
> Key: HIVE-20682
> URL: https://issues.apache.org/jira/browse/HIVE-20682
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Sankar Hariappan
>Assignee: Sankar Hariappan
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20682.01.patch, HIVE-20682.02.patch, 
> HIVE-20682.03.patch, HIVE-20682.04.patch, HIVE-20682.05.patch, 
> HIVE-20682.06.patch
>
>
> *Problem description:*
> The master thread initializes the *sessionHive* object in *HiveSessionImpl* 
> class when we open a new session for a client connection and by default all 
> queries from this connection shares the same sessionHive object. 
> If the master thread executes a *synchronous* query, it closes the 
> sessionHive object (referred via thread local hiveDb) if  
> {{Hive.isCompatible}} returns false and sets new Hive object in thread local 
> HiveDb but doesn't change the sessionHive object in the session. Whereas, 
> *asynchronous* query execution via async threads never closes the sessionHive 
> object and it just creates a new one if needed and sets it as their thread 
> local hiveDb.
> So, the problem can happen in the case where an *asynchronous* query is being 
> executed by async threads refers to sessionHive object and the master thread 
> receives a *synchronous* query that closes the same sessionHive object. 
> Also, each query execution overwrites the thread local hiveDb object to 
> sessionHive object which potentially leaks a metastore connection if the 
> previous synchronous query execution re-created the Hive object.
> *Possible Fix:*
> The *sessionHive* object could be shared my multiple threads and so it 
> shouldn't be allowed to be closed by any query execution threads when they 
> re-create the Hive object due to changes in Hive configurations. But the Hive 
> objects created by query execution threads should be closed when the thread 
> exits.
> So, it is proposed to have an *isAllowClose* flag (default: *true*) in Hive 
> object which should be set to *false* for *sessionHive* and would be 
> forcefully closed when the session is closed or released.
> Also, when we reset *sessionHive* object with new one due to changes in 
> *sessionConf*, the old one should be closed when no async thread is referring 
> to it. This can be done using "*finalize*" method of Hive object where we can 
> close HMS connection when Hive object is garbage collected.
> cc [~pvary]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20806) Add ASF license for files added in HIVE-20679

2018-10-25 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20806:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to master.

> Add ASF license for files added in HIVE-20679
> -
>
> Key: HIVE-20806
> URL: https://issues.apache.org/jira/browse/HIVE-20806
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-20806.1.patch
>
>
> HIVE-20679 added couple of new files Deserialzer/Serialzer that needs the ASF 
> license header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20806) Add ASF license for files added in HIVE-20679

2018-10-25 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20806:
---
Status: Patch Available  (was: Open)

> Add ASF license for files added in HIVE-20679
> -
>
> Key: HIVE-20806
> URL: https://issues.apache.org/jira/browse/HIVE-20806
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Assignee: anishek
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-20806.1.patch
>
>
> HIVE-20679 added couple of new files Deserialzer/Serialzer that needs the ASF 
> license header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20806) Add ASF license for files added in HIVE-20679

2018-10-25 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20806:
---
Attachment: HIVE-20806.1.patch

> Add ASF license for files added in HIVE-20679
> -
>
> Key: HIVE-20806
> URL: https://issues.apache.org/jira/browse/HIVE-20806
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: 4.0.0
>Reporter: anishek
>Priority: Trivial
> Fix For: 4.0.0
>
> Attachments: HIVE-20806.1.patch
>
>
> HIVE-20679 added couple of new files Deserialzer/Serialzer that needs the ASF 
> license header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-25 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663469#comment-16663469
 ] 

anishek commented on HIVE-20679:


Thanks [~sankarh] created HIVE-20806 for the same. please review.

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.8.patch, HIVE-20679.9.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20746) HiveProtoHookLogger does not close file at end of day.

2018-10-24 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20746:
---
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master, Thanks [~harishjp]

> HiveProtoHookLogger does not close file at end of day.
> --
>
> Key: HIVE-20746
> URL: https://issues.apache.org/jira/browse/HIVE-20746
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Fix For: 4.0.0
>
> Attachments: HIVE-20746.01.patch, HIVE-20746.02.patch
>
>
> The file rotation happens with an event currently. If there are no queries 
> fired for a long time, then the file rotation does not happen and we do not 
> close the file. This causes the clients to poll for the file for an 
> indeterminate amount of time. If there are multiple hiveservers there is no 
> way to tell which file will get more data. Fix this to close at end of day.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-22 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
   Resolution: Fixed
Fix Version/s: 4.0.0
   Status: Resolved  (was: Patch Available)

Committed to master, Thanks [~sankarh] for the review !

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.8.patch, HIVE-20679.9.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-21 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.9.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.8.patch, HIVE-20679.9.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-18 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.8.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.8.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-18 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: (was: HIVE-20679.7.patch)

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-17 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654645#comment-16654645
 ] 

anishek commented on HIVE-20679:


[~sankarh] please review 

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.7.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.7.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, HIVE-20679.7.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20697) Some replication tests are super slow and cause batch timeouts

2018-10-17 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654633#comment-16654633
 ] 

anishek commented on HIVE-20697:


Thanks [~vihangk1], does this mean skipbatching is available for all commits 
running / submitted ? There is no patch that needs to be uploaded here for now 
for this correct ?

> Some replication tests are super slow and cause batch timeouts
> --
>
> Key: HIVE-20697
> URL: https://issues.apache.org/jira/browse/HIVE-20697
> Project: Hive
>  Issue Type: Test
>Reporter: Vihang Karajgaonkar
>Priority: Major
>
> Some of these tests are taking a long time and can cause test batch timeouts 
> given that we only give 40 min for a batch to complete. We should speed these 
> tests up.
> TestReplicationScenarios  20 min
> TestReplicationScenariosAcidTables11 min
> TestReplicationScenariosAcrossInstances   5 min 14 sec
> TestReplicationScenariosIncrementalLoadAcidTables 20 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: (was: HIVE-20679..6.patch)

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679..6.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-17 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.6.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, 
> HIVE-20679.6.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20746) HiveProtoHookLogger does not close file at end of day.

2018-10-17 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16653164#comment-16653164
 ] 

anishek commented on HIVE-20746:


+1 

> HiveProtoHookLogger does not close file at end of day.
> --
>
> Key: HIVE-20746
> URL: https://issues.apache.org/jira/browse/HIVE-20746
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Reporter: Harish Jaiprakash
>Assignee: Harish Jaiprakash
>Priority: Major
> Attachments: HIVE-20746.01.patch, HIVE-20746.02.patch
>
>
> The file rotation happens with an event currently. If there are no queries 
> fired for a long time, then the file rotation does not happen and we do not 
> close the file. This causes the clients to poll for the file for an 
> indeterminate amount of time. If there are multiple hiveservers there is no 
> way to tell which file will get more data. Fix this to close at end of day.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20697) Some replication tests are super slow and cause batch timeouts

2018-10-15 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649994#comment-16649994
 ] 

anishek commented on HIVE-20697:


i tried to skip batching of these tests via the master-mr2.porperties but that 
didnot help, All f this is in HIVE-20679 runs however.
run times.
The reason the tests take a longer time is because there are various uses cases 
that are tried with test trying to do it as a end user, there are commands that 
needs processing in hive like insert / delete / create followed by repl dump / 
load commands which effectively does the same thing twice with some metadata 
work in between, hence the longer run times. 

> Some replication tests are super slow and cause batch timeouts
> --
>
> Key: HIVE-20697
> URL: https://issues.apache.org/jira/browse/HIVE-20697
> Project: Hive
>  Issue Type: Test
>Reporter: Vihang Karajgaonkar
>Priority: Major
>
> Some of these tests are taking a long time and can cause test batch timeouts 
> given that we only give 40 min for a batch to complete. We should speed these 
> tests up.
> TestReplicationScenarios  20 min
> TestReplicationScenariosAcidTables11 min
> TestReplicationScenariosAcrossInstances   5 min 14 sec
> TestReplicationScenariosIncrementalLoadAcidTables 20 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-15 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.5.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, HIVE-20679.5.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-12 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16647635#comment-16647635
 ] 

anishek commented on HIVE-20679:


reattached the patch to find out why skip batching is not working.

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.4.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-12 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: (was: HIVE-20679.4.patch)

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20545) Ability to exclude potentially large parameters in HMS Notifications

2018-10-10 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644514#comment-16644514
 ] 

anishek commented on HIVE-20545:


Thanks [~vihangk1] for the details. I wanted to understand some more on the 
thirft object before/after serialization, Since  DBNotificationListener is on 
the metastore, there should not be additional serialization of thrift objets 
there, however as the json grows larger definitely there is additional overhead 
of space on rdbms + in memory to keep them, which i think would be good to 
reduce. 

So the serialization i assume you are mentioning is the json serialization of 
the java objects? There is another effort we are looking at via HIVE-20679 to 
see if we can have zipped message in db to reduce the network transfer and 
rdbms overhead of these messages. 

Just out of curiosity these stats i assume will help impala to better plan the 
query ? if yes how will the same work on target warehouse after replication ?

> Ability to exclude potentially large parameters in HMS Notifications
> 
>
> Key: HIVE-20545
> URL: https://issues.apache.org/jira/browse/HIVE-20545
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Bharathkrishna Guruvayoor Murali
>Assignee: Bharathkrishna Guruvayoor Murali
>Priority: Major
> Attachments: HIVE-20545.1.patch, HIVE-20545.2.patch, 
> HIVE-20545.3.branch-3.patch, HIVE-20545.3.patch, HIVE-20545.4.patch, 
> HIVE-20545.6.patch, HIVE-20545.7.patch
>
>
> Clients can add large-sized parameters in Table/Partition objects. So we need 
> to enable adding regex patterns through HiveConf to match parameters to be 
> filtered from table and partition objects before serialization in HMS 
> notifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-09 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16644457#comment-16644457
 ] 

anishek commented on HIVE-20679:


[~diser555] the number of records should not affect this, unless you mean 
record = partitions ?

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-09 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Description: 
Certain type of ddl operations might create large messages as part of 
DBNoitification, this might lead to the rdbms throwing an error when storing 
the message since its size is to large. It will also increase the footprint of 
the rdbms space usage. 

We should try store compressed messages to allow handling these situations. 

Edit: For notification_log table the message column for all supported databases 
can store messages from 2GB to 4GB



  was:
Certain type of ddl operations might create large messages as part of 
DBNoitification, this might lead to the rdbms throwing an error when storing 
the message since its size is to large. It will also increase the footprint of 
the rdbms space usage. 

We should try store compressed messages to allow handling these situations. 




> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 
> Edit: For notification_log table the message column for all supported 
> databases can store messages from 2GB to 4GB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-09 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.4.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, HIVE-20679.4.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20697) Some replication tests are super slow and cause batch timeouts

2018-10-09 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16643028#comment-16643028
 ] 

anishek commented on HIVE-20697:


[~vihangk1] do you know how the batch is decided, is there a way to force 
batches with only couple of tests in batch above ?


> Some replication tests are super slow and cause batch timeouts
> --
>
> Key: HIVE-20697
> URL: https://issues.apache.org/jira/browse/HIVE-20697
> Project: Hive
>  Issue Type: Test
>Reporter: Vihang Karajgaonkar
>Priority: Major
>
> Some of these tests are taking a long time and can cause test batch timeouts 
> given that we only give 40 min for a batch to complete. We should speed these 
> tests up.
> TestReplicationScenarios  20 min
> TestReplicationScenariosAcidTables11 min
> TestReplicationScenariosAcrossInstances   5 min 14 sec
> TestReplicationScenariosIncrementalLoadAcidTables 20 min



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-07 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.3.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, 
> HIVE-20679.3.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20545) Ability to exclude potentially large parameters in HMS Notifications

2018-10-07 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641049#comment-16641049
 ] 

anishek commented on HIVE-20545:


[~bharos92] cant access the link you have sent, is there someway you can 
provide the size of these stats. and also if you could give some details as to 
where do you see the performance hit that would be great.


> Ability to exclude potentially large parameters in HMS Notifications
> 
>
> Key: HIVE-20545
> URL: https://issues.apache.org/jira/browse/HIVE-20545
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Bharathkrishna Guruvayoor Murali
>Assignee: Bharathkrishna Guruvayoor Murali
>Priority: Major
> Attachments: HIVE-20545.1.patch, HIVE-20545.2.patch, 
> HIVE-20545.3.branch-3.patch, HIVE-20545.3.patch, HIVE-20545.4.patch, 
> HIVE-20545.6.patch, HIVE-20545.7.patch
>
>
> Clients can add large-sized parameters in Table/Partition objects. So we need 
> to enable adding regex patterns through HiveConf to match parameters to be 
> filtered from table and partition objects before serialization in HMS 
> notifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HIVE-20679) DDL operations on hive might create large messages for DBNotification

2018-10-05 Thread anishek (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-20679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

anishek updated HIVE-20679:
---
Attachment: HIVE-20679.2.patch

> DDL operations on hive might create large messages for DBNotification
> -
>
> Key: HIVE-20679
> URL: https://issues.apache.org/jira/browse/HIVE-20679
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: anishek
>Assignee: anishek
>Priority: Major
> Attachments: HIVE-20679.1.patch, HIVE-20679.2.patch, a.sql, b.sql
>
>
> Certain type of ddl operations might create large messages as part of 
> DBNoitification, this might lead to the rdbms throwing an error when storing 
> the message since its size is to large. It will also increase the footprint 
> of the rdbms space usage. 
> We should try store compressed messages to allow handling these situations. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20545) Ability to exclude potentially large parameters in HMS Notifications

2018-10-05 Thread anishek (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16639547#comment-16639547
 ] 

anishek commented on HIVE-20545:


also i would like to know what is the message size before and after the filter 
application in your usecase.

> Ability to exclude potentially large parameters in HMS Notifications
> 
>
> Key: HIVE-20545
> URL: https://issues.apache.org/jira/browse/HIVE-20545
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 3.1.0, 4.0.0
>Reporter: Bharathkrishna Guruvayoor Murali
>Assignee: Bharathkrishna Guruvayoor Murali
>Priority: Major
> Attachments: HIVE-20545.1.patch, HIVE-20545.2.patch, 
> HIVE-20545.3.branch-3.patch, HIVE-20545.3.patch, HIVE-20545.4.patch, 
> HIVE-20545.6.patch, HIVE-20545.7.patch
>
>
> Clients can add large-sized parameters in Table/Partition objects. So we need 
> to enable adding regex patterns through HiveConf to match parameters to be 
> filtered from table and partition objects before serialization in HMS 
> notifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   >