[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-06-25 Thread Rushabh S Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh S Shah updated HADOOP-9867:
---

Attachment: HADOOP-9867.patch

I am tracking this jira for a while.
I read all the comments by Jason.
I guess this patch will address the Jason's comments.
I have used the test case provided in Vinayakumar's patch and modified a little 
bit to test exhaustively.


 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinayakumar B
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, 
 HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2014-01-29 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-9867:
---

Target Version/s: 2.3.0  (was: )

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-12-09 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HADOOP-9867:
--

Attachment: HADOOP-9867.patch

Attaching the updated patch based on HADOOP-9622 changes

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HADOOP-9867:
--

Attachment: HADOOP-9867.patch

Attaching a patch with the test mentioned by Jason.

Reading one more record if the split ends between the delimiter bytes.

Please review.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Priority: Critical
 Attachments: HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HADOOP-9867:
--

Attachment: HADOOP-9867.patch

Updated possible NPE

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-20 Thread Vinay (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HADOOP-9867:
--

Assignee: Vinay
  Status: Patch Available  (was: Open)

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 2.2.0, 0.23.9, 0.20.2
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Assignee: Vinay
Priority: Critical
 Attachments: HADOOP-9867.patch, HADOOP-9867.patch


 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well

2013-11-19 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-9867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated HADOOP-9867:
---

 Priority: Critical  (was: Major)
 Target Version/s: 2.3.0
Affects Version/s: 0.23.9
   2.2.0

Raising severity since this involves loss of data.  Also I confirmed this is an 
issue on recent Hadoop versions as well.

 org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record 
 delimiters well
 --

 Key: HADOOP-9867
 URL: https://issues.apache.org/jira/browse/HADOOP-9867
 Project: Hadoop Common
  Issue Type: Bug
  Components: io
Affects Versions: 0.20.2, 0.23.9, 2.2.0
 Environment: CDH3U2 Redhat linux 5.7
Reporter: Kris Geusebroek
Priority: Critical

 Having defined a recorddelimiter of multiple bytes in a new InputFileFormat 
 sometimes has the effect of skipping records from the input.
 This happens when the input splits are split off just after a 
 recordseparator. Starting point for the next split would be non zero and 
 skipFirstLine would be true. A seek into the file is done to start - 1 and 
 the text until the first recorddelimiter is ignored (due to the presumption 
 that this record is already handled by the previous maptask). Since the re 
 ord delimiter is multibyte the seek only got the last byte of the delimiter 
 into scope and its not recognized as a full delimiter. So the text is skipped 
 until the next delimiter (ignoring a full record!!)



--
This message was sent by Atlassian JIRA
(v6.1#6144)