[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated MAPREDUCE-6549: -- Fix Version/s: 2.8.0 > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Fix For: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1 > > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated MAPREDUCE-6549: --- Fix Version/s: (was: 2.7.3) (was: 2.8.0) 2.7.2 Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing fix-versions to reflect the same. > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Fix For: 2.7.2, 2.6.3 > > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated MAPREDUCE-6549: -- Fix Version/s: 2.7.3 2.6.3 Thanks, [~wilfreds]! Agree this should be in 2.7.3 and 2.6.3, so I committed this to branch-2.7 and branch-2.6 as well. > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Fix For: 2.8.0, 2.6.3, 2.7.3 > > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated MAPREDUCE-6549: - Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Thanks Wilfred and everyone who helped out on this. Committed to trunk and branch-2! > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Fix For: 2.8.0 > > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Open (was: Patch Available) > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Attachment: MAPREDUCE-6549.3.patch > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Patch Available (was: Open) Updated the patch to fix the NPE in the testUncompressedInputCustomDelimiterPosValue Checked the license, findbugs and other junit test failures and they are not related to the changes from this patch > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, > MAPREDUCE-6549.3.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Attachment: MAPREDUCE-6549-2.patch The issue is related to [MAPREDUCE-6481]. That jira changed the position calculation and made sure that the full records are returned by the reader as expected. It did not anticipate the record duplication. Junit tests also did not cover the use cases correctly to discover the issue. The problem is limited to multi byte delimiters only as far as I can trace. The junit tests for the multi byte delimiter only take the best case scenario into account. The input data contained the exact delimiter and no ambiguous characters. As soon as the test is changed, either the delimiter or the input data, a failure will be triggered. The issue with the failure is that it does not clearly show when and how it fails. Analysis of the test failures shows that a complex combination of input data, split and buffer size will trigger a failure. Based on testing the duplication of the record occurs only if: - the first character(s) of the delimiter are part of the record data, example: 1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not followed by {{=}} 2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not followed by {{=}} - the delimiter character is found at the split boundary: the last character before the split ends - a fill of the buffer is triggered to finish processing the record The underlying problem is that we set a flag called {{needAdditionalRecord}} in the {{UncompressedSplitLineReader}} when we fill the buffer and have encountered part of a delimiter in combination with a split. We keep track of this in the ambiguous character number. However is it turns out that if the character(s) found after that point do not belong to a delimiter we do not unset the {{needAdditionalRecord}}. This causes the next record to be read twice and thus we see a duplication of records. The solution would be to unset the flag when we detect that we're not processing a delimiter. We currently only add the ambiguous characters to the record read and set the number back to 0. At the same point we need to unset the flag. The patch was developed based on junit tests that exercise the split and buffer settings in combination with multiple delimiter types using different inputs. All cases now provide a consistent count of records and correct position inside the data. > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Patch Available (was: Open) > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Component/s: mrv2 mrv1 > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Open (was: Patch Available) I tried the change that you made in the patch and it fails the current tests. The patch changes one test (TestLineRecordReader.java) but we have two versions. The mapred version is unchanged and now fails. The mapreduce version works but as soon as I change the delimiter back it also fails. That means that the change does not fix the issue. it also brings the two tests out of sync which is not correct > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Dustin Cote > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dustin Cote updated MAPREDUCE-6549: --- Attachment: MAPREDUCE-6549-1.patch Attaching a patch to basically remove the attempt to read the last incomplete record of an input and change the tests to test a more generic, imperfect scenario. I'll add some more tests if review deems it necessary. As far as I am aware, we should drop an incomplete record at the end of the input, which now this happens with this patch in addition to the correct number of records coming up in the middle of the input (where previously there were duplicates). > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Dustin Cote > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dustin Cote updated MAPREDUCE-6549: --- Status: Patch Available (was: Open) [~zxu], could you review this? > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Dustin Cote > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)