[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2013-02-27 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13588702#comment-13588702
 ] 

Nick Dimiduk commented on HBASE-5210:
-

Can this issue be reproduced in a more modern HBase? Can we close this as WON'T 
FIX as we sunset the 0.90 line?

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-27 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13195065#comment-13195065
 ] 

Todd Lipcon commented on HBASE-5210:


In Store.java, when we bulk-load the MR output, we rename to a randomly 
generated filename in the region directory, using a UUID, it looks like. So the 
names of the MR output files should be inconsequential.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-24 Thread Lawrence Simpson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192217#comment-13192217
 ] 

Lawrence Simpson commented on HBASE-5210:
-

@Todd:
Two questions about your solution:
1. If we were to form a file name from just the numeric digits of the task 
attempt ID, that would be 23 digits.  As I look at the file names for HBase 
tables, they seem to be 18-19 digits long.  Do you know if there are any 
assumptions made in other HBase code about the length of file names for store 
files?
2. In the unlikely event that there was a name conflict with an HFile created 
by a reducer, what should happen then?  (The job number looks like it might 
roll at 1 jobs - I don't know if anyone has gotten that far without 
restarting Map/Reduce.)  

It still seems to me that the safest solution is a change to HFileOutputFormat 
to use a new output committer class that adds rename logic to 
moveTaskOutputs().  These changes could be implemented strictly in the HBase 
code tree without having to involve the underlying Hadoop implementation. 

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.


[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-23 Thread Jimmy Xiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191310#comment-13191310
 ] 

Jimmy Xiang commented on HBASE-5210:


Any fix in getRandomFilename will just reduce the chance of file name 
collision.  Since this a rare case, I think it may be better to just fail the 
task if failed to commit the files in the moveTaskOutputs(), without 
overwriting the existing files.  In HDFS 0.23, rename() takes an option not to 
overwrite.  With HADOOP 0.20, we can just do our best to check any conflicts 
before committing the files.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-23 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191314#comment-13191314
 ] 

Zhihong Yu commented on HBASE-5210:
---

I prefer Lawrence's approach.
The only consideration is that it takes relatively long for the proposed change 
in FileOutputCommitter.moveTaskOutputs() to be published, reviewed and pushed 
upstream.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-23 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191327#comment-13191327
 ] 

Todd Lipcon commented on HBASE-5210:


Why not change the output file name to be based on the task attempt ID? There 
is already a unique id for each task available...

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-23 Thread Jimmy Xiang (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191351#comment-13191351
 ] 

Jimmy Xiang commented on HBASE-5210:


I like this one.  It's really simple and clean.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-21 Thread Lawrence Simpson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190537#comment-13190537
 ] 

Lawrence Simpson commented on HBASE-5210:
-

@Andrew:
I looked at your suggested patch.  It may not work as well as you hope since it 
depends on System.nanoTime() changing rapidly which it may not do on all 
systems.  There are discussions in other blogs about this.  I believe that 
java.util.Random uses System.nanoTime() to do default seeding, and it has not 
been terribly successful in my case.
@Zhiyong:
I believe that you are correct in that this is the appropriate place to make a 
change. However, I did not suggest this as a MAPREDUCE change because the 
existing behavior may be correct for applications other than generating HFiles. 
 I can imagine situations in which one would want the latest version of a file 
produced by several reducers to be the only one left at the end of a map/reduce 
job.  However, it's definitely not appropriate when producing HFiles for an 
incremental load.  My own solution which I am testing now is to clone 
FileOutputCommitter and add logic to the moveTaskOutputs() method that creates 
a new name for any conflicting files.  FileOutputCommitter had too many 
components that were private to make it easy to subclass.  I subclassed 
FileOutputFormat to use the new output committer class and then I used that 
subclass in my map/reduce job.  I included a logging statement in the new 
moveTaskOutputs() method so that I can tell when the rename logic is triggered. 
 It may take awhile to see if the logic is successful since the two occurrences 
that I was able to track down occurred two months apart in a job that runs 
nightly.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 

[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load

2012-01-16 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187463#comment-13187463
 ] 

Zhihong Yu commented on HBASE-5210:
---

@Larry:
Have you filed a MAPREDUCE- JIRA ?
FileOutputCommitter.moveTaskOutputs() might be the right place for the change.

 HFiles are missing from an incremental load
 ---

 Key: HBASE-5210
 URL: https://issues.apache.org/jira/browse/HBASE-5210
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.2
 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync).  
 RHEL 2.6.18-164.15.1.el5.  4 node cluster (1 master, 3 slaves)
Reporter: Lawrence Simpson
 Attachments: HBASE-5210-crazy-new-getRandomFilename.patch


 We run an overnight map/reduce job that loads data from an external source 
 and adds that data to an existing HBase table.  The input files have been 
 loaded into hdfs.  The map/reduce job uses the HFileOutputFormat (and the 
 TotalOrderPartitioner) to create HFiles which are subsequently added to the 
 HBase table.  On at least two separate occasions (that we know of), a range 
 of output would be missing for a given day.  The range of keys for the 
 missing values corresponded to those of a particular region.  This implied 
 that a complete HFile somehow went missing from the job.  Further 
 investigation revealed the following:
  * Two different reducers (running in separate JVMs and thus separate class 
 loaders)
  * in the same server can end up using the same file names for their
  * HFiles.  The scenario is as follows:
  *1.  Both reducers start near the same time.
  *2.  The first reducer reaches the point where it wants to write its 
 first file.
  *3.  It uses the StoreFile class which contains a static Random 
 object 
  *which is initialized by default using a timestamp.
  *4.  The file name is generated using the random number generator.
  *5.  The file name is checked against other existing files.
  *6.  The file is written into temporary files in a directory named
  *after the reducer attempt.
  *7.  The second reduce task reaches the same point, but its 
 StoreClass
  *(which is now in the file system's cache) gets loaded within the
  *time resolution of the OS and thus initializes its Random()
  *object with the same seed as the first task.
  *8.  The second task also checks for an existing file with the name
  *generated by the random number generator and finds no conflict
  *because each task is writing files in its own temporary folder.
  *9.  The first task finishes and gets its temporary files committed
  *to the real folder specified for output of the HFiles.
  * 10.The second task then reaches its own conclusion and commits its
  *files (moveTaskOutputs).  The released Hadoop code just 
 overwrites
  *any files with the same name.  No warning messages or anything.
  *The first task's HFiles just go missing.
  * 
  *  Note:  The reducers here are NOT different attempts at the same 
  *reduce task.  They are different reduce tasks so data is
  *really lost.
 I am currently testing a fix in which I have added code to the Hadoop 
 FileOutputCommitter.moveTaskOutputs method to check for a conflict with
 an existing file in the final output folder and to rename the HFile if
 needed.  This may not be appropriate for all uses of FileOutputFormat.
 So I have put this into a new class which is then used by a subclass of
 HFileOutputFormat.  Subclassing of FileOutputCommitter itself was a bit 
 more of a problem due to private declarations.
 I don't know if my approach is the best fix for the problem.  If someone
 more knowledgeable than myself deems that it is, I will be happy to share
 what I have done and by that time I may have some information on the
 results.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira