[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

2016-09-22 Thread Sushanth Sowmyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513804#comment-15513804
 ] 

Sushanth Sowmyan commented on HIVE-13348:
-

Removing gsoc tag, as this was not proceeded on for gsoc.

> Add Event Nullification support for Replication
> ---
>
> Key: HIVE-13348
> URL: https://issues.apache.org/jira/browse/HIVE-13348
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Reporter: Sushanth Sowmyan
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

2016-04-26 Thread Naveen Gangam (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257932#comment-15257932
 ] 

Naveen Gangam commented on HIVE-13348:
--

Thanks for the clarification. I understood this feature to be some sort of 
queuing/batching of the events at the source that are periodically processed 
while removing negating events. It wasnt clear to me if this filtering was 
going to be done downstream or at the source. It now makes sense to me. Thanks 
again

> Add Event Nullification support for Replication
> ---
>
> Key: HIVE-13348
> URL: https://issues.apache.org/jira/browse/HIVE-13348
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Reporter: Sushanth Sowmyan
>  Labels: gsoc2016
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

2016-04-25 Thread Sushanth Sowmyan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256953#comment-15256953
 ] 

Sushanth Sowmyan commented on HIVE-13348:
-

Sorry, to clarify, the idea is not to nullify the events in the main eventlog 
itself - we will still maintain those, and they are under the purview of the 
metastore currently - the idea is that when a program calls 
HCatClient.getReplicationTasks which exposes an Iterator, 
where currently, there is a 1:1 map from Event to ReplicationTask, and we 
should ideally have a many-one.

Thus, this filtering would be downstream of the actual collection of events, it 
would be in-stream for the processing of replication events.

Or are you suggesting that even for replication, we should allow the capability 
to send along noop-replication-tasks as marker tasks for those events which 
were nullified, so we can have an audit on the destination? That could be done 
too, and would be performant as well.

> Add Event Nullification support for Replication
> ---
>
> Key: HIVE-13348
> URL: https://issues.apache.org/jira/browse/HIVE-13348
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Reporter: Sushanth Sowmyan
>  Labels: gsoc2016
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-13348) Add Event Nullification support for Replication

2016-04-25 Thread Naveen Gangam (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256395#comment-15256395
 ] 

Naveen Gangam commented on HIVE-13348:
--

[~sushanth] While it doesnt make sense to have events that negate each other 
for the use case above, there are usecases when listeners will need full set of 
events, for example a simple listener that logs all events for AUDIT purposes. 
There are regulations for financial institutions to audit every 
add/delete/modify events. I am not sure if these regulations apply to hive 
specifically today, but we would have to assume they could be atleast in the 
future.
So any filtering or batching of these events may have to be listener specific 
or have an option to dial down or dial up the notifications. Hope this makes 
sense. Thanks

> Add Event Nullification support for Replication
> ---
>
> Key: HIVE-13348
> URL: https://issues.apache.org/jira/browse/HIVE-13348
> Project: Hive
>  Issue Type: Sub-task
>  Components: Import/Export
>Reporter: Sushanth Sowmyan
>  Labels: gsoc2016
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)