[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: (was: bzipTest.bz2) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: (was: PIG-570.patch) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: PIG-570.patch Tightened up the test case and increased the number of bits used for the signature to the full 48-bits. (Since I now use the start of the block boundary as the offset we can use the whole thing.) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: (was: PIG-570.patch) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: PIG-570.patch bzipTest.bz2 Fixed the bzip for the test cases to have carefully crafted bad corner cases. Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-570: --- Resolution: Fixed Status: Resolved (was: Patch Available) patch committed; thanks, Ben! Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.1.0, site, 0.0.0 Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: (was: PIG-570.patch) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: bzipTest.bz2 So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: PIG-570.patch Regenerated patch against types branch Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: PIG-570.patch I believe the problem is due to bad position tracking. In the current version of the code, we chop up the input into blocks, but unfortunately when using bzip there are bzip block boundaries, HDFS block boundaries, and record boundaries. if the bzip block boundaries line up too closely, a record could get skipped or possibly corrupted. i was able to reproduce a problem, hopefully it is the same as your problem in the attached test case. the root cause turn out to be improper tracking of position. if we blindly use the position of the underlying stream and a bzip block and HDFS block line up we may think that we have read the first record of the next slice when in fact we have only read the bzip block header. the attached patch fixes the problem by defining the position of the stream as the position of the start of the current block header in the underlying stream. Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Attachment: bzipTest.bz2 this is the test data for the bzip unit test. it should go under test/org/apache/pig/test/data/bzipTest.bz2 Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: types_branch, 0.0.0, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.0.0, 0.1.0, site Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig
[ https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Reed updated PIG-570: -- Status: Patch Available (was: Open) Large BZip files Seem to loose data in Pig --- Key: PIG-570 URL: https://issues.apache.org/jira/browse/PIG-570 Project: Pig Issue Type: Bug Affects Versions: 0.0.0, types_branch, 0.1.0, site Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2 Reporter: Alex Newman Fix For: types_branch, 0.1.0, site, 0.0.0 Attachments: bzipTest.bz2, PIG-570.patch So I don't believe bzip2 input to pig is working, at least not with large files. It seems as though map files are getting cut off. The maps complete way too quickly and the actual row of data that pig tries to process often randomly gets cut, and becomes incomplete. Here are my symptoms: - Maps seem to be completing in a unbelievably fast rate With uncompressed data Status: Succeeded Started at: Wed Dec 17 21:31:10 EST 2008 Finished at: Wed Dec 17 22:42:09 EST 2008 Finished in: 1hrs, 10mins, 59sec map 100.00% 4670 0 0 46700 0 / 21 reduce57.72% 130 0 13 0 0 / 4 With bzip compressed data Started at: Wed Dec 17 21:17:28 EST 2008 Failed at: Wed Dec 17 21:17:52 EST 2008 Failed in: 24sec Black-listed TaskTrackers: 2 Kind % Complete Num Tasks Pending Running CompleteKilled Failed/Killed Task Attempts map 100.00% 183 0 0 15 168 54 / 22 reduce100.00% 130 0 0 13 0 / 0 The errors we get: ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, 0HAW, CHIX, ) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) Last 4KB attempt_200812161759_0045_m_07_0 task_200812161759_0045_m_07 tsdhb06.factset.com FAILED java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002) at org.apache.pig.data.Tuple.getField(Tuple.java:176) at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84) at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38) at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223) at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58) at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60) at org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.