[jira] Updated: (PIG-930) merge join should handle compressed bz2 sorted files
[ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-930: --- Fix Version/s: (was: 0.8.0) Unlinking from the release. We have not really seen user asks for this > merge join should handle compressed bz2 sorted files > > > Key: PIG-930 > URL: https://issues.apache.org/jira/browse/PIG-930 > Project: Pig > Issue Type: Bug >Reporter: Pradeep Kamath > > There are two issues - POLoad which is used to read the right side input does > not handle bz2 files right now. This needs to be fixed. > Further inn the index map job we bindTo(startOfBlockOffSet) (this will > internally discard first tuple if offset > 0). Then we do the following: > {noformat} > While(tuple survives pipeline) { > Pos = getPosition() > getNext() > run the tuple through pipeline in the right side which could have filter > } > Emit(key, pos, filename). > {noformat} > > Then in the map job which does the join, we bindTo(pos > 0 ? pos 1 : pos) > (we do pos -1 because bindTo will discard first tuple for pos> 0). Then we do > getNext() > Now in bz2 compressed files, getPosition() returns a position which is not > really accurate. The problem is it could be a position in the middle of a > compressed bz2 block. Then when we use that position to bindTo() in the final > map job, the code would first hunt for a bz2 block header thus skipping the > whole current bz2 block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-930) merge join should handle compressed bz2 sorted files
[ https://issues.apache.org/jira/browse/PIG-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-930: --- Fix Version/s: 0.8.0 Likely, this is no longer an issue in 0.7.0. Need to verify and add unit tests > merge join should handle compressed bz2 sorted files > > > Key: PIG-930 > URL: https://issues.apache.org/jira/browse/PIG-930 > Project: Pig > Issue Type: Bug >Reporter: Pradeep Kamath > Fix For: 0.8.0 > > > There are two issues - POLoad which is used to read the right side input does > not handle bz2 files right now. This needs to be fixed. > Further inn the index map job we bindTo(startOfBlockOffSet) (this will > internally discard first tuple if offset > 0). Then we do the following: > {noformat} > While(tuple survives pipeline) { > Pos = getPosition() > getNext() > run the tuple through pipeline in the right side which could have filter > } > Emit(key, pos, filename). > {noformat} > > Then in the map job which does the join, we bindTo(pos > 0 ? pos 1 : pos) > (we do pos -1 because bindTo will discard first tuple for pos> 0). Then we do > getNext() > Now in bz2 compressed files, getPosition() returns a position which is not > really accurate. The problem is it could be a position in the middle of a > compressed bz2 block. Then when we use that position to bindTo() in the final > map job, the code would first hunt for a bz2 block header thus skipping the > whole current bz2 block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.