[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-23 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377297#comment-14377297
 ] 

Hitesh Shah edited comment on TEZ-1909 at 3/24/15 5:07 AM:
---

Comments:

{code}
LOG.warn(Other recovery files will be skipped due to error in the previous 
recovery file);
{code}
  - please add the file name to this log line

For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to 
TEZ_TEST_... and likewise change property value. No scope defined? 

It seems like the patch for this jira has been merged with fixes for a 
different jira? Can these be separated out? 





was (Author: hitesh):
Comments:

{code}
LOG.warn(Other recovery files will be skipped due to error in the previous 
recovery file);
{code}
  - please add the file name to this line as well as its length 

For TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED, maybe change to 
TEZ_TEST_... and likewise change property value. No scope defined? 

It seems like the patch for this jira has been merged with fixes for a 
different jira? Can these be separated out? 




 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825
 ] 

Jeff Zhang edited comment on TEZ-1909 at 3/20/15 9:14 AM:
--

Attach the new patch to address the review comment. [~hitesh] Please help 
review 
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.







was (Author: zjffdu):
Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.






 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-18 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14366825#comment-14366825
 ] 

Jeff Zhang edited comment on TEZ-1909 at 3/18/15 8:37 AM:
--

Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size equal to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still been processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.







was (Author: zjffdu):
Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue 
about RecoveryService. For the scenario of draining the events before 
RecoverySerivce is stopped, previously I take the event queue's size eqaul to 
zero as an indication of events are all consumed, but it is not true. Because 
even if the event queue is empty, the event may still being processing. I fix 
this bug in the new patch just like AsyncDispatcher did. 

bq. the if (skipAllOtherEvents) { check is probably also needed at the top of 
the loop to prevent new files from being opened and read ( in addition to 
short-circuiting the read of all events in the given file ). Maybe just log a 
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser

bq. any reason why this is needed in the DAGAppMaster SetString getDagIDs() 
?
Only for unit test. But in the new patch, I remove it and initialize the Set in 
the setup method.

bq. also, we should add a test for adding corrupt data to the summary stream 
and ensuring that its processing fails
Done.

bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used 
anywhere apart from being set to true in one of the tests.
Fix it.

bq. please replace import com.sun.tools.javac.util.List; with java.lang.List
Fix it

bq. testCorruptedLastRecord should also verify that the dag submitted event was 
seen.
Done. verify DAGAppMaster.createDAG is invoked.






 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2015-03-11 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357592#comment-14357592
 ] 

Hitesh Shah edited comment on TEZ-1909 at 3/11/15 9:02 PM:
---

Minor comments: 

 - there can be cases where data is partially written hence there might be an 
error when reading the last record. Maybe we should add a simulated test for 
this by writing invalid data to the end of an intermediate summary and dag file 
and seeing whether the code handles it correctly?
 - skipAllOtherEvents should probably be a flag across all files for a given 
dag. At the moment, it is considered only for a single dag file and reset. 
 - log line LOG.info(isSpeculationEnabled: + isSpeculationEnabled); was 
removed - not sure why. 

{code}
for (int attemptNum=1; attemptNum=3; ++attemptNum) {
  ListHistoryEvent historyEvents = new ArrayListHistoryEvent();
  for (int i=1 ;i=attemptNum;++i) {
Path currentAttemptRecoveryDataDir = 
TezCommonUtils.getAttemptRecoveryPath(recoveryDataDir,i);
Path recoveryFilePath = new Path(currentAttemptRecoveryDataDir,
appId.toString().replace(application, dag) + _1 + 
TezConstants.DAG_RECOVERY_RECOVER_FILE_SUFFIX);
historyEvents.addAll(RecoveryParser.parseDAGRecoveryFile(
fs.open(recoveryFilePath)));
  }
{code}

The above code needs a bit of cleanup in TestDAGRecovery - not sure why we need 
2 loops for the 3 attempts' recovery data. 




was (Author: hitesh):
Minor comments: 

 - there can be cases where data is partially written hence there might be an 
error when reading the last record. Maybe we should add a simulated test for 
this by writing invalid data to the end of an intermediate summary and dag file 
and seeing whether the code handles it correctly?
 - skipAllOtherEvents should probably be a flag across all files for a given 
dag. At the moment, it is considered only for a single dag file and reset. 
 - log line LOG.info(isSpeculationEnabled: + isSpeculationEnabled); was 
removed - not sure why. 

{code}
for (int attemptNum=1; attemptNum=3; ++attemptNum) {
  ListHistoryEvent historyEvents = new ArrayListHistoryEvent();
  for (int i=1 ;i=attemptNum;++i) {
Path currentAttemptRecoveryDataDir = 
TezCommonUtils.getAttemptRecoveryPath(recoveryDataDir,i);
Path recoveryFilePath = new Path(currentAttemptRecoveryDataDir,
appId.toString().replace(application, dag) + _1 + 
TezConstants.DAG_RECOVERY_RECOVER_FILE_SUFFIX);
historyEvents.addAll(RecoveryParser.parseDAGRecoveryFile(
fs.open(recoveryFilePath)));
  }
{code}

The above code needs a bit of cleanup - not sure why we need 2 loops for the 3 
attempts' recovery data. 



 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang
 Attachments: TEZ-1909-1.patch


 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-1909) Remove need to copy over all events from attempt 1 to attempt 2 dir

2014-12-30 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262006#comment-14262006
 ] 

Hitesh Shah edited comment on TEZ-1909 at 12/31/14 7:23 AM:


Actually both. 

Today, we end up copying over data from the previous attempt into the current 
attempt's directory. ( the attempt specific directly already exists hence 
covers part 1 of your comment ). It might be better to just have a chain of 
partial files to reduce the copy overhead.


was (Author: hitesh):
Actually both. 

Today, we end up copying over data from the previous attempt into the current 
attempt's directory. It might be better to just have a chain of partial files 
to reduce the copy overhead.

 Remove need to copy over all events from attempt 1 to attempt 2 dir
 ---

 Key: TEZ-1909
 URL: https://issues.apache.org/jira/browse/TEZ-1909
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Jeff Zhang

 Use of file versions should prevent the need for copying over data into a 
 second attempt dir. Care needs to be taken to handle last corrupt record 
 handling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)