[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-31 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-14919:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   2.7.5
   2.8.3
   2.9.0
   Status: Resolved  (was: Patch Available)

Thanks to [~chris.douglas], [~ajisakaa], and [~tanakahda] for reviews!  I 
committed this to trunk, branch-3.0, branch-2, branch-2.8, and branch-2.7.


> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Aki Tanaka
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 2.9.0, 2.8.3, 2.7.5, 3.0.0
>
> Attachments: 25.bz2, HADOOP-14919-test.patch, 
> HADOOP-14919.001.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-14919:

Target Version/s: 2.9.0, 2.8.3, 2.7.5, 3.0.0
  Status: Patch Available  (was: Open)

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1, 2.7.3, 2.8.0
>Reporter: Aki Tanaka
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: 25.bz2, HADOOP-14919.001.patch, 
> HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-14919:

Attachment: HADOOP-14919.001.patch

A lot of the troubles with split handling in this codec are related to these 
issues:
1) It seeks _backwards_ from the split start looking for a possible bzip2 block 
header and sometimes miscomputes where that should start.
2) It is reporting stream positions that can skip data that has not been 
processed yet (as in this case).

So here's my pitch at fixing this for the Nth time:
- No seeking backwards.  Any block whose start marker appears before the split 
is not the responsibility of this split's reader.  Split processing should 
never require reading data just before the split offset, as that data is 
entirely the responsibility of the previous split's reader.
- Stream position reports the byte where the block start marker begins rather 
than the byte after it ends.  That way we're always consistent about who is 
responsible for "consuming" a block start header and never report byte 
positions that may have skipped some data bits.

This is probably going to break _something_ given how many attempts there have 
been at fixing this, so I greatly appreciate any and all eyes willing to take a 
look.  Alternative proposals are also welcome.

Attaching a patch that implements the approach described above.  Thanks to 
[~tanakahda] for the unit test.  I extended it to test all the split positions 
around the block start marker for this test case. Speaking of tests, Jenkins 
won't run all the related tests, so I also manually ran the 
TestLineRecordReader tests in hadoop-mapreduce-client-core and they passed.


> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>Reporter: Aki Tanaka
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: 25.bz2, HADOOP-14919.001.patch, 
> HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
> We will see HADOOP-13270 issue though.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Description: 
BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(00110001010101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 25.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 0010   /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.

Also, if we reverted the changes in HADOOP-13270, we will not see this issue. 
We will see HADOOP-13270 issue though.

  was:
BZip2 can drop records when reading data in splits. This problem was already 
discussed before in HADOOP-11445 and HADOOP-13270. But we still have a problem 
in corner case, causing lost data blocks.
 
I attached a unit test for this issue. You can reproduce the problem if you run 
the unit test.
 
First, this issue happens when position of newly created stream is equal to 
start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). However, 
the issue I am reporting does not happen when we run these tests because this 
issue happens only when the start of split byte block includes both block 
marker and compressed data.
 
BZip2 block marker - 0x314159265359 
(00110001010101011001001001100101001101011001)
 
blockEndingInCR.txt.bz2 (Start of Split - 136504):
{code:java}
$ xxd -l 6 -g 1 -b -seek 136498 
./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
{code}

 
Test bz2 File (Start of Split - 203426)
{code:java}
$ xxd -l 7 -g 1 -b -seek 203419 25.bz2
0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
0031aa1: 0010   /
{code}

 
Let's say a job splits this test bz2 file into two splits at the start of split 
(position 203426).
The former split does not read records which start position 203426 because 
BZip2 says the position of these dropped records is 203427. The latter split 
does not read the records because BZip2CompressionInputStream read the block 
from position 320955.
Due to this behavior, records between 203427 and 320955 are lost.



> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: 25.bz2, HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.

[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Attachment: 25.bz2

Adding the test bz2 file (The bz2 file that the attached unit test generates)

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: 25.bz2, HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Updated] (HADOOP-14919) BZip2 drops records when reading data in splits

2017-10-02 Thread Aki Tanaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aki Tanaka updated HADOOP-14919:

Attachment: HADOOP-14919-test.patch

Add patch for the unit test.

> BZip2 drops records when reading data in splits
> ---
>
> Key: HADOOP-14919
> URL: https://issues.apache.org/jira/browse/HADOOP-14919
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Aki Tanaka
> Attachments: HADOOP-14919-test.patch
>
>
> BZip2 can drop records when reading data in splits. This problem was already 
> discussed before in HADOOP-11445 and HADOOP-13270. But we still have a 
> problem in corner case, causing lost data blocks.
>  
> I attached a unit test for this issue. You can reproduce the problem if you 
> run the unit test.
>  
> First, this issue happens when position of newly created stream is equal to 
> start of split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 
> file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker, etc). 
> However, the issue I am reporting does not happen when we run these tests 
> because this issue happens only when the start of split byte block includes 
> both block marker and compressed data.
>  
> BZip2 block marker - 0x314159265359 
> (00110001010101011001001001100101001101011001)
>  
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 0101 01011001 00100110 01010011 01011001  1AY
> {code}
>  
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 25.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 0010   /
> {code}
>  
> Let's say a job splits this test bz2 file into two splits at the start of 
> split (position 203426).
> The former split does not read records which start position 203426 because 
> BZip2 says the position of these dropped records is 203427. The latter split 
> does not read the records because BZip2CompressionInputStream read the block 
> from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org