[ 
https://issues.apache.org/jira/browse/PIG-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Szita updated PIG-5373:
----------------------------
    Comment: was deleted

(was: Attached [^PIG-5373.0.patch] which corrects the reading of sync markers 
using a fifo, and compares the fifo content with the expected marker.

Test case attached, which verifies in a brute force way, that such prefix 
scenarios are handled well.

[~nkollar], [~rohini] can you take a look please?)

> InterRecordReader might skip records if certain sync markers are used
> ---------------------------------------------------------------------
>
>                 Key: PIG-5373
>                 URL: https://issues.apache.org/jira/browse/PIG-5373
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: PIG-5373.0.patch
>
>
> Due to bug in InterRecordReader#skipUntilMarkerOrSplitEndOrEOF(), it can 
> happen that sync markers are not identified while reading the interim binary 
> file used to hold data between jobs.
> In such files sync markers are placed upon writing, which later help during 
> reading the data. These are random generated and it seems like that in some 
> rare combinations of markers and data preceding it, they cannot be not found. 
> This can result in reading through all the bytes (looking for the marker) and 
> reaching split end or EOF, and extracting no records at all.
> This symptom is also observable from JobHistory stats, where if a job is 
> affected by this issue, will have tasks that have HDFS_BYTES_READ or 
> FILE_BYTES_READ about equal to the number bytes of the split, but at the same 
> time having MAP_INPUT_RECORDS=0
> One such (test) example is this:
> {code:java}
> marker: [-128, -128, 4] , data: [127, -1, 2, -128, -128, -128, 4, 1, 2, 
> 3]{code}
> Due to a bug, such markers whose prefix overlap with the last data chunk are 
> not seen by the reader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to