[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: (was: bzipTest.bz2)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: (was: PIG-570.patch)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: PIG-570.patch

Tightened up the test case and increased the number of bits used for the 
signature to the full 48-bits. (Since I now use the start of the block boundary 
as the offset we can use the whole thing.)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: (was: PIG-570.patch)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: PIG-570.patch
bzipTest.bz2

Fixed the bzip for the test cases to have carefully crafted bad corner cases.

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-06 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-570:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed; thanks, Ben!

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.1.0, site, 0.0.0

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-05 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: (was: PIG-570.patch)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: bzipTest.bz2


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2009-01-05 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: PIG-570.patch

Regenerated patch against types branch

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2008-12-30 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: PIG-570.patch

I believe the problem is due to bad position tracking. In the current version 
of the code, we chop up the input into blocks, but unfortunately when using 
bzip there are bzip block boundaries, HDFS block boundaries, and record 
boundaries. if the bzip block boundaries line up too closely, a record could 
get skipped or possibly corrupted.

i was able to reproduce a problem, hopefully it is the same as your problem in 
the attached test case.

the root cause turn out to be improper tracking of position. if we blindly 
use the position of the underlying stream and a bzip block and HDFS block line 
up we may think that we have read the first record of the next slice when in 
fact we have only read the bzip block header.

the attached patch fixes the problem by defining the position of the stream as 
the position of the start of the current block header in the underlying stream.

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2008-12-30 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Attachment: bzipTest.bz2

this is the test data for the bzip unit test. it should go under 
test/org/apache/pig/test/data/bzipTest.bz2

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch, 0.0.0, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.0.0, 0.1.0, site

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-570) Large BZip files Seem to loose data in Pig

2008-12-30 Thread Benjamin Reed (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Reed updated PIG-570:
--

Status: Patch Available  (was: Open)

 Large BZip files  Seem to loose data in Pig
 ---

 Key: PIG-570
 URL: https://issues.apache.org/jira/browse/PIG-570
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.0.0, types_branch, 0.1.0, site
 Environment: Pig 0.1.1/Linux / 8 Nodes hadoop 0.18.2
Reporter: Alex Newman
 Fix For: types_branch, 0.1.0, site, 0.0.0

 Attachments: bzipTest.bz2, PIG-570.patch


 So I don't believe  bzip2 input to pig is working, at least not with large 
 files. It seems as though map files are getting cut off. The maps complete 
 way too quickly and the actual row of data that pig tries to process often 
 randomly gets cut, and becomes incomplete. Here are my symptoms:
 - Maps seem to be completing in a unbelievably fast rate
 With uncompressed data
 Status: Succeeded
 Started at: Wed Dec 17 21:31:10 EST 2008
 Finished at: Wed Dec 17 22:42:09 EST 2008
 Finished in: 1hrs, 10mins, 59sec
 map   100.00%
 4670  0   0   46700   0 / 21
 reduce57.72%
 130   0   13  0   0 / 4
 With bzip compressed data
 Started at: Wed Dec 17 21:17:28 EST 2008
 Failed at: Wed Dec 17 21:17:52 EST 2008
 Failed in: 24sec
 Black-listed TaskTrackers: 2
 Kind  % Complete  Num Tasks   Pending Running CompleteKilled  
 Failed/Killed
 Task Attempts
 map   100.00%
 183   0   0   15  168 54 / 22
 reduce100.00%
 130   0   0   13  0 / 0
 The errors we get:
 ava.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec
 A, 0HAW, CHIX, )
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 Last 4KB
 attempt_200812161759_0045_m_07_0  task_200812161759_0045_m_07 
 tsdhb06.factset.com FAILED  
 java.lang.IndexOutOfBoundsException: Requested index 11 from tuple (rec   
 A, CSGN, VTX, VTX, 0, 20080303, 90919, 380, 1543, 206002)
   at org.apache.pig.data.Tuple.getField(Tuple.java:176)
   at org.apache.pig.impl.eval.ProjectSpec.eval(ProjectSpec.java:84)
   at org.apache.pig.impl.eval.SimpleEvalSpec$1.add(SimpleEvalSpec.java:38)
   at org.apache.pig.impl.eval.EvalSpec.simpleEval(EvalSpec.java:223)
   at org.apache.pig.impl.eval.cond.CompCond.eval(CompCond.java:58)
   at org.apache.pig.impl.eval.FilterSpec$1.add(FilterSpec.java:60)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapreduceExec.PigMapReduce.run(PigMapReduce.java:117)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.