[jira] [Commented] (HADOOP-17042) Hadoop distcp throws "ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"

2020-05-16 Thread Aki Tanaka (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109240#comment-17109240
 ] 

Aki Tanaka commented on HADOOP-17042:
-

Thank you, [~aajisaka]

>  In addition, can we move the script from libexec/hadoop-distcp.sh to 
>libexec/tools/hadoop-distcp.sh ? This change can be done in a separate jira.
 
I may not have a proper understanding of this, but I think we cannot move 
libexec/shellprofile.d/hadoop-distcp.sh to libexec/tools/hadoop-distcp.sh. 
Because it's used when calling hadoop_import_shellprofiles.
[https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/bin/hadoop-functions.sh#L725]
 
As far as I checked, the libexec/tools/hadoop-distcp.sh file (and other scripts 
in libexec/tools/) does not exist. We might not use this 
hadoop_add_to_classpath_tools function now. If so, I think we can remove this 
hadoop_add_to_classpath_tools method.

> Hadoop distcp throws "ERROR: Tools helper 
> ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"
> -
>
> Key: HADOOP-17042
> URL: https://issues.apache.org/jira/browse/HADOOP-17042
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.3
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Minor
> Attachments: HADOOP-17042.patch
>
>
> On Hadoop 3.x, we see following "ERROR: Tools helper 
> ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found." message on 
> the first line of the command output when running Hadoop DistCp.
> {code:java}
> $ hadoop distcp /path/to/src /user/hadoop/
> ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not 
> found.
> 2020-05-14 17:11:53,173 INFO tools.DistCp: Input Options: 
> DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, 
> ignoreFailures=false, overwrite=false, append=false, useDiff=false, 
> useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, 
> blocking=true
> ..
> {code}
> This message was added by HADOOP-12857 and it would be an expected behavior.
>  DistCp calls 'hadoop_add_to_classpath_tools hadoop-distcp' when [it 
> starts|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/shellprofile.d/hadoop-distcp.sh],
>  and the error is returned because the hadoop-distcp.sh does not exist in the 
> tools directory.
> However, that error message confuses us. Since this is not an user end 
> configuration issue, I would think it's better to change the log level to 
> debug (hadoop_debug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-17042) Hadoop distcp throws "ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"

2020-05-14 Thread Aki Tanaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HADOOP-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-17042:

Attachment: HADOOP-17042.patch

> Hadoop distcp throws "ERROR: Tools helper 
> ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"
> -
>
> Key: HADOOP-17042
> URL: https://issues.apache.org/jira/browse/HADOOP-17042
> Project: Hadoop Common
>  Issue Type: Improvement
>Affects Versions: 3.2.1, 3.1.3
>Reporter: Aki Tanaka
>Priority: Minor
> Attachments: HADOOP-17042.patch
>
>
> On Hadoop 3.x, we see following "ERROR: Tools helper 
> ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found." message on 
> the first line of the command output when running Hadoop DistCp.
> {code:java}
> $ hadoop distcp /path/to/src /user/hadoop/
> ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not 
> found.
> 2020-05-14 17:11:53,173 INFO tools.DistCp: Input Options: 
> DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, 
> ignoreFailures=false, overwrite=false, append=false, useDiff=false, 
> useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, 
> blocking=true
> ..
> {code}
> This message was added by HADOOP-12857 and it would be an expected behavior.
>  DistCp calls 'hadoop_add_to_classpath_tools hadoop-distcp' when [it 
> starts|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/shellprofile.d/hadoop-distcp.sh],
>  and the error is returned because the hadoop-distcp.sh does not exist in the 
> tools directory.
> However, that error message confuses us. Since this is not an user end 
> configuration issue, I would think it's better to change the log level to 
> debug (hadoop_debug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-17042) Hadoop distcp throws "ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"

2020-05-14 Thread Aki Tanaka (Jira)
Aki Tanaka created HADOOP-17042:
---

 Summary: Hadoop distcp throws "ERROR: Tools helper 
///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found"
 Key: HADOOP-17042
 URL: https://issues.apache.org/jira/browse/HADOOP-17042
 Project: Hadoop Common
  Issue Type: Improvement
Affects Versions: 3.1.3, 3.2.1
Reporter: Aki Tanaka
 Attachments: HADOOP-17042.patch

On Hadoop 3.x, we see following "ERROR: Tools helper 
///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not found." message on the 
first line of the command output when running Hadoop DistCp.
{code:java}
$ hadoop distcp /path/to/src /user/hadoop/
ERROR: Tools helper ///usr/lib/hadoop/libexec/tools/hadoop-distcp.sh was not 
found.
2020-05-14 17:11:53,173 INFO tools.DistCp: Input Options: 
DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, 
ignoreFailures=false, overwrite=false, append=false, useDiff=false, 
useRdiff=false, fromSnapshot=null, toSnapshot=null, skipCRC=false, blocking=true
..
{code}
This message was added by HADOOP-12857 and it would be an expected behavior.
 DistCp calls 'hadoop_add_to_classpath_tools hadoop-distcp' when [it 
starts|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/shellprofile.d/hadoop-distcp.sh],
 and the error is returned because the hadoop-distcp.sh does not exist in the 
tools directory.

However, that error message confuses us. Since this is not an user end 
configuration issue, I would think it's better to change the log level to debug 
(hadoop_debug).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-03-08 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391568#comment-16391568
 ] 

Aki Tanaka commented on HADOOP-15206:
-

In my understanding, skipBytes-- will not be executed basically.

bufferedIn.read() is executed only when skipBytes is 0, this usually means that 
the file position is at the end of the split. However, InputStream.skip says 
"Skips over and discards {{n}} bytes of data from the input stream. The 
{{skip}} method may, for a variety of reasons, end up skipping over some 
smaller number of bytes, possibly {{0}}. The actual number of bytes skipped is 
returned." 
([https://docs.oracle.com/javase/7/docs/api/java/io/FilterInputStream.html).] 
So I thought InputStream.skip() might return 0 even if the position is not at 
the end of the split. 

 

Please let me know if my understanding is wrong. Thank you.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Major
> Fix For: 3.1.0, 2.10.0, 2.9.1, 2.8.4, 2.7.6, 3.0.2
>
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch, HADOOP-15206.007.patch, 
> HADOOP-15206.008.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-15 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.008.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch, HADOOP-15206.007.patch, 
> HADOOP-15206.008.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-15 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366061#comment-16366061
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you. Since this is the first time to apply a patch to the community, I 
apologize for having bothered you.

Update the patch, I think I fixed the points you pointed out, please let me 
know if there is anything I should do.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch, HADOOP-15206.007.patch, 
> HADOOP-15206.008.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-14 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365134#comment-16365134
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you!! I updated the patch.
 * Changed to try to skip the rest of the bytes in one call
 * Deleted tailing whitespaces
 * Deleted comments in the code
 * Changed comment "steam is on BZip2 header" to "a split is before the first 
BZip2 block"

 

The unit test for 006 failed but it looks that the failed test was not related 
to this patch. 

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch, HADOOP-15206.007.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-14 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.007.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch, HADOOP-15206.007.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-14 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.006.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-14 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364611#comment-16364611
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you very much for the review!

I've added the updated patch to the ticket. Please let me know if you have any 
questions on this.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch, HADOOP-15206.006.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-13 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362627#comment-16362627
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you, I updated the patch.

 

Implemented the following behavior:
 * If we are in BLOCK mode we avoid advertising the byte position if start 
offset is zero 
 * In BLOCK mode we should also skip to file offset HEADER_LEN + SUB_HEADER_LEN 
+ 1 if the start position is >=0 and < HEADER_LEN + SUB_HEADER_LEN. That will 
put us one byte past the start of the first bz2 block, and BLOCK mode will 
automatically scan forward to the next block

One thing  I am still not sure is when resetState() is called.  We need to skip 
file reading position to 5 to avoid duplicated records f the start position is 
>=0 and < HEADER_LEN + SUB_HEADER_LEN.  I implemented this in 
readStreamHeader(). On the other hand, when InputStream is reset, 
readStreamHeader() is called. I changed the readStreamHeader() is executed only 
when this.compressedStreamPosition is 0 , but we might be okay to remove 
calling readStreamHeader() when InputStream is reset.

 

 

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-13 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.005.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch, 
> HADOOP-15206.005.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-10 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359803#comment-16359803
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Hi Jason,

I really appreciate your code review, I refactored readStreamHeader() , I 
changed the method's return value from buffered input to a length the buffered 
input read in the method. 
 * 
{quote}check for and read the full header if it is at starting position 0
{quote}

In the current implementation, read only "BZ" header when the read mode is 
CONTINUOUS. Do you think we should keep this?

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-10 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.004.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch, HADOOP-15206.004.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-08 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357510#comment-16357510
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Added the updated patch. Please let me know if I still misunderstand something.

I made following changes to the original code
 * Not advertising a new byte position when reading from BZip2 Header (position 
0) 

 * Move reading position to right after the BZip2 header (position 5) when the 
position is between 1 and 4

This implementation moves the start position forcibly without checking whether 
the BZ2 file has a header or not. Because I could not determine whether the 
header exists when the start position is 4. However, I think it's safe to move 
the position even if the file does not have a bz2 header because we cannot put 
2 bz2 blocks in the first 4 bytes of the file.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-08 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.003.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch, HADOOP-15206.003.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-06 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354819#comment-16354819
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you very much for the comments!
{quote}
This doesn't just drop the first bzip2 block, it drops the entire split. This 
goes back to my previous comment about the code assuming splits that start 
between bytes 1-4 are always tiny. Splits do not have to be equally sized, so 
theoretically there could be just two splits where the first split is a 
two-byte split starting at offset 0 and the other split is the rest of the file.
{quote}
Thank you for explaining the details. I understand the problem.

 
{quote}The logic regarding the header seems backwards. If the header is 
stripped then that means there was a header present, yet the logic is only 
adding up bytes for a header length if it was not stripped which is the case 
when the header is not there.
{quote}
That's right... Thank you for pointing this out.
After some tests, I noticed the following 2 points.

1. When reading from position 1-3 (on bzip2 header), 
isHeaderStripped/isSubHeaderStripped is always false. This is because the 
current readStreamHeader() works only when the start position is 0.

2. I set one byte beyond the start of the first bzip2 block (header_len + 1) to 
the InputStream's start position, but duplicated records issue still happened. 
When I set header_len + 5 (9), we can avoid the problem.

As far as I looked at the test bz2 file using binary editor, the first bz2 
marker starts from position 4 (right after bz2 header). 
Still trying to understand why we need to set header_len + 5.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-06 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16354317#comment-16354317
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Thank you for the comment! I've updated the patch.

 
>The code above is making the following assumptions that I believe could not be 
>true in some cases:
>The bzip2 file header is always present at starting pos 0. I think it should 
>be checking isHeaderStripped/isSubHeaderStripped.
>If the split starts after byte 0 but before byte 5 then it must also end on or 
>before byte 5. (Splits are not required to be equally sized.)
 
Thank you for pointing these out. I've modified the patch code and uploaded to 
the Jira.

 
> If the bzip2 file header is four bytes, why is the condition {{<= 4}} instead 
> of {{< 4}}? Should this code leverage HEADER_LEN and SUB_HEADER_LEN here?

When we changed the condition to < 4, the first bz2 block will duplicate. 
Because 4 is a position of the first bz2 block marker, and an input stream will 
start reading the first bz2 block if the start position of the input stream is 
4. 

The concept of my patch is changing the position of the first BZ2 block marker 
from 4 (when the bzip2 file header is present) to 0 forcibly. So, if the input 
stream tries to read from position 1-4, it will drop the first BZ2 block even 
though the block marker position is 4. I am not sure whether this is the best 
solution but I could not come up with another idea.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-06 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.002.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch, 
> HADOOP-15206.002.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-05 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16353510#comment-16353510
 ] 

Aki Tanaka commented on HADOOP-15206:
-

[~jlowe]

Thank you for your insights. I have created a patch based on your comment.

As far as I tested, all the unit tests passed and I confirmed that the issue I 
was seeing was solved.

 

I greatly appreciate any and someone take a look. Alternative proposals are 
also very welcome.

 

 

Regarding the duplicated record scenario, the record was read twice when 
BZip2Codec starts reading at position 0 (BZip2 header) and position 4 (first 
BZip2 marker).

test.bz2:0+1 -> read 100 records

test.bz2:3+4 -> read 99 records

 

2018-02-05 20:49:51,598 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(324)) - 
splits[0]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:0+1
 count=100
2018-02-05 20:49:51,605 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(326)) - 
splits[1]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:1+1
 count=0
2018-02-05 20:49:51,608 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(326)) - 
splits[2]=file:/Users/tanakah/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:2+1
 count=0

2018-02-05 20:49:51,614 ERROR [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(313)) - read 1
2018-02-05 20:49:51,617 WARN  [Thread-3] mapred.TestTextInputFormat2 
(TestTextInputFormat2.java:verifyPartitions(315)) - conflict with 1 in split 3 
at position 7

 

 

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-05 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206.001.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch, HADOOP-15206.001.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits

2018-02-01 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349239#comment-16349239
 ] 

Aki Tanaka commented on HADOOP-14919:
-

Hello,

Regarding this issue, I found another corner case of BZip2 codec/input stream 
behavior. Created HADOOP-15206 .

I'd like someone in this thread look at HADOOP-15206 too. Thank you.

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Aki Tanaka
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 2.9.0, 2.8.3, 2.7.5, 3.0.0
>
> Attachments: 25.bz2, HADOOP-14919-test.patch, 
> HADOOP-14919.001.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-01 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-15206:

Attachment: HADOOP-15206-test.patch

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-01 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349234#comment-16349234
 ] 

Aki Tanaka commented on HADOOP-15206:
-

Added a unit test that can reproduce the problem.

> BZip2 drops and duplicates records when input split size is small
> -
>
> Key: HADOOP-15206
> URL: https://issues.apache.org/jira/browse/HADOOP-15206
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.3, 3.0.0
>Reporter: Aki Tanaka
>Priority: Major
> Attachments: HADOOP-15206-test.patch
>
>
> BZip2 can drop and duplicate record when input split file is small. I 
> confirmed that this issue happens when the input split size is between 1byte 
> and 4bytes.
> I am seeing the following 2 problem behaviors.
>  
> 1. Drop record:
> BZip2 skips the first record in the input file when the input split size is 
> small
>  
> Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
> {code:java}
> 2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(317)) - 
> splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
>  count=99{code}
> > The input format read only 99 records but not 100 records
>  
> 2. Duplicate Record:
> 2 input splits has same BZip2 records when the input split size is small
>  
> Set the split size to 1 and tested to load 100 records (0, 1, 2..99)
>  
> {code:java}
> 2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
> /work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
>  count=99
> 2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
> (TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
> at position 8
> {code}
>  
> I experienced this error when I execute Spark (SparkSQL) job under the 
> following conditions:
> * The file size of the input files are small (around 1KB)
> * Hadoop cluster has many slave nodes (able to launch many executor tasks)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-15206) BZip2 drops and duplicates records when input split size is small

2018-02-01 Thread Aki Tanaka (JIRA)
Aki Tanaka created HADOOP-15206:
---

 Summary: BZip2 drops and duplicates records when input split size 
is small
 Key: HADOOP-15206
 URL: https://issues.apache.org/jira/browse/HADOOP-15206
 Project: Hadoop Common
  Issue Type: Bug
Affects Versions: 3.0.0, 2.8.3
Reporter: Aki Tanaka


BZip2 can drop and duplicate record when input split file is small. I confirmed 
that this issue happens when the input split size is between 1byte and 4bytes.

I am seeing the following 2 problem behaviors.

 

1. Drop record:

BZip2 skips the first record in the input file when the input split size is 
small

 

Set the split size to 3 and tested to load 100 records (0, 1, 2..99)
{code:java}
2018-02-01 10:52:33,502 INFO  [Thread-17] mapred.TestTextInputFormat 
(TestTextInputFormat.java:verifyPartitions(317)) - 
splits[1]=file:/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+3
 count=99{code}
> The input format read only 99 records but not 100 records

 

2. Duplicate Record:

2 input splits has same BZip2 records when the input split size is small

 

Set the split size to 1 and tested to load 100 records (0, 1, 2..99)

 
{code:java}
2018-02-01 11:18:49,309 INFO [Thread-17] mapred.TestTextInputFormat 
(TestTextInputFormat.java:verifyPartitions(318)) - splits[3]=file 
/work/count-mismatch2/hadoop/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/target/test-dir/TestTextInputFormat/test.bz2:3+1
 count=99
2018-02-01 11:18:49,310 WARN [Thread-17] mapred.TestTextInputFormat 
(TestTextInputFormat.java:verifyPartitions(308)) - conflict with 1 in split 4 
at position 8
{code}
 

I experienced this error when I execute Spark (SparkSQL) job under the 
following conditions:

* The file size of the input files are small (around 1KB)

* Hadoop cluster has many slave nodes (able to launch many executor tasks)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-05 Thread Aki Tanaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193075#comment-16193075
 ] 

Aki Tanaka commented on HADOOP-14919:
-

Thank you for the patch. I tested the patch and confirmed that the patch can 
fix the issues we saw in our production environment.
As far as I tested, I did not see any regressions or new issues.

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Aki Tanaka
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: 25.bz2, HADOOP-14919.001.patch, 
> HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Description: 
BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(00110001010101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 25.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 0010   /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.

Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
We will see HADOOP-13270 issue though.

  was:
BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(00110001010101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 25.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 0010   /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.



> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: 25.bz2, HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.

[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Attachment: 25.bz2

Adding the test bz2 file (The bz2 file that the attached unit test generates)

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: 25.bz2, HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Attachment: HADOOP-14919-test.patch

Add patch for the unit test.

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Created] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)
Aki Tanaka created HADOOP-14919:
---

 Summary: BZip2 drops records when reading data in splits
 Key: HADOOP-14919
 URL: https://issues.apache.org/jira/browse/HADOOP-14919
 Project: Hadoop Common
  Issue Type: Bug
Reporter: Aki Tanaka


BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(00110001010101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 25.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 0010   /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org