subject:"\[jira\] \[Updated\] \(MAPREDUCE\-6549\) multibyte delimiters with LineRecordReader cause duplicate records"

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2017-01-05 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated MAPREDUCE-6549:
--
Fix Version/s: 2.8.0

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Fix For: 2.8.0, 2.7.2, 2.6.3, 3.0.0-alpha1
>
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2016-01-13 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated MAPREDUCE-6549:
---
Fix Version/s: (was: 2.7.3)
   (was: 2.8.0)
   2.7.2

Pulled this into 2.7.2 to keep the release up-to-date with 2.6.3. Changing 
fix-versions to reflect the same.

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Fix For: 2.7.2, 2.6.3
>
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-30 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-6549:
--
Fix Version/s: 2.7.3
   2.6.3

Thanks, [~wilfreds]!  Agree this should be in 2.7.3 and 2.6.3, so I committed 
this to branch-2.7 and branch-2.6 as well.

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Fix For: 2.8.0, 2.6.3, 2.7.3
>
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-25 Thread Robert Kanter (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kanter updated MAPREDUCE-6549:
-
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Thanks Wilfred and everyone who helped out on this.

Committed to trunk and branch-2!

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Fix For: 2.8.0
>
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-24 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Open  (was: Patch Available)

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-24 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Attachment: MAPREDUCE-6549.3.patch

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-24 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Patch Available  (was: Open)

Updated the patch to fix the NPE in the 
testUncompressedInputCustomDelimiterPosValue

Checked the license, findbugs and other junit test failures and they are not 
related to the changes from this patch

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch, 
> MAPREDUCE-6549.3.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Attachment: MAPREDUCE-6549-2.patch

The issue is related to [MAPREDUCE-6481]. That jira changed the position 
calculation and made sure that the full records are returned by the reader as 
expected. It did not anticipate the record duplication. Junit tests also did 
not cover the use cases correctly to discover the issue.
The problem is limited to multi byte delimiters only as far as I can trace. 

The junit tests for the multi byte delimiter only take the best case scenario 
into account. The input data contained the exact delimiter and no ambiguous 
characters. As soon as the test is changed, either the delimiter or the input 
data, a failure will be triggered. The issue with the failure is that it does 
not clearly show when and how it fails. Analysis of the test failures shows 
that a complex combination of input data, split and buffer size will trigger a 
failure.

Based on testing the duplication of the record occurs only if:
- the first character(s) of the delimiter are part of the record data, example: 
  1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not 
followed by {{=}}
  2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not 
followed by {{=}}
- the delimiter character is found at the split boundary: the last character 
before the split ends
- a fill of the buffer is triggered to finish processing the record

The underlying problem is that we set a flag called {{needAdditionalRecord}} in 
the {{UncompressedSplitLineReader}} when we fill the buffer and have 
encountered part of a delimiter in combination with a split. We keep track of 
this in the ambiguous character number. However is it turns out that if the 
character(s) found after that point do not belong to a delimiter we do not 
unset the {{needAdditionalRecord}}. This causes the next record to be read 
twice and thus we see a duplication of records.
The solution would be to unset the flag when we detect that we're not 
processing a delimiter. We currently only add the ambiguous characters to the 
record read and set the number back to 0. At the same point we need to unset 
the flag.

The patch was developed based on junit tests that exercise the split and buffer 
settings in combination with multiple delimiter types using different inputs. 
All cases now provide a consistent count of records and correct position inside 
the data.

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Patch Available  (was: Open)

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Component/s: mrv2
 mrv1

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Open  (was: Patch Available)

I tried the change that you made in the patch and it fails the current tests.
The patch changes one test (TestLineRecordReader.java) but we have two 
versions. The mapred version is unchanged and now fails. The mapreduce version 
works but as soon as I change the delimiter back it also fails. That means that 
the change does not fix the issue.

it also brings the two tests out of sync which is not correct

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Dustin Cote
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-14 Thread Dustin Cote (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Cote updated MAPREDUCE-6549:
---
Attachment: MAPREDUCE-6549-1.patch

Attaching a patch to basically remove the attempt to read the last incomplete 
record of an input and change the tests to test a more generic, imperfect 
scenario.  I'll add some more tests if review deems it necessary.  As far as I 
am aware, we should drop an incomplete record at the end of the input, which 
now this happens with this patch in addition to the correct number of records 
coming up in the middle of the input (where previously there were duplicates).

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Dustin Cote
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-14 Thread Dustin Cote (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dustin Cote updated MAPREDUCE-6549:
---
Status: Patch Available  (was: Open)

[~zxu], could you review this?

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Dustin Cote
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

13 matches

Site Navigation

Mail list logo

Footer information