[jira] [Created] (PIG-2061) NewPlan match() is sensitive to ordering
NewPlan match() is sensitive to ordering Key: PIG-2061 URL: https://issues.apache.org/jira/browse/PIG-2061 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Priority: Minor There is no current Rule that is affected by this but inside TestNewPlanRule.java {noformat} 155 public void testMultiNode() throws Exception { ... 175 pattern.connect(op1, op3); 176 pattern.connect(op2, op3); ... 178 Rule r = new SillyRule("basic", pattern); 179 List l = r.match(plan); 180 assertEquals(1, l.size()); {noformat} but this test fail when we swap line 175 and 176 even though they are structurally equivalent. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2044) Patten match bug in org.apache.pig.newplan.optimizer.Rule
[ https://issues.apache.org/jira/browse/PIG-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-2044: -- Attachment: PIG-2044-00.patch Taking out 'break' statement which made the for-loop meaningless. Added one test. > Patten match bug in org.apache.pig.newplan.optimizer.Rule > - > > Key: PIG-2044 > URL: https://issues.apache.org/jira/browse/PIG-2044 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Koji Noguchi > Fix For: 0.10 > > Attachments: PIG-2044-00.patch > > > Koji find that we have a bug org.apache.pig.newplan.optimizer.Rule. The > "break" in line 179 seems to be wrong. This multiple branch matching is not > used in Pig, but could be a problem for the future. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2055) inconsistentcy behavior in parser generated during build
[ https://issues.apache.org/jira/browse/PIG-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032783#comment-13032783 ] Koji Noguchi commented on PIG-2055: --- I hit this as well on my macbook. It drove me crazy. Using antlr-3.3 (instead of 3.2) seems to have fixed it for me. > inconsistentcy behavior in parser generated during build > - > > Key: PIG-2055 > URL: https://issues.apache.org/jira/browse/PIG-2055 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Thejas M Nair > > On certain builds, i see that pig fails to support this syntax - > {code} > grunt> l = load 'x' using PigStorage(':'); > 2011-05-10 09:21:41,565 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1200: mismatched input '(' expecting SEMI_COLON > Details at logfile: /Users/tejas/pig_trunk_cp/trunk/pig_1305044484712.log > {code} > I seem to be the only one who has seen this behavior, and I have seen on > occassion when I build on mac. It could be problem with antlr and apple jvm > interaction. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2044) Patten match bug in org.apache.pig.newplan.optimizer.Rule
[ https://issues.apache.org/jira/browse/PIG-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-2044: -- Status: Patch Available (was: Open) > Patten match bug in org.apache.pig.newplan.optimizer.Rule > - > > Key: PIG-2044 > URL: https://issues.apache.org/jira/browse/PIG-2044 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Daniel Dai >Assignee: Koji Noguchi > Fix For: 0.10 > > Attachments: PIG-2044-00.patch > > > Koji find that we have a bug org.apache.pig.newplan.optimizer.Rule. The > "break" in line 179 seems to be wrong. This multiple branch matching is not > used in Pig, but could be a problem for the future. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2802) Wrong Schema generated when there is a dangling alias
[ https://issues.apache.org/jira/browse/PIG-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi resolved PIG-2802. --- Resolution: Duplicate > Wrong Schema generated when there is a dangling alias > - > > Key: PIG-2802 > URL: https://issues.apache.org/jira/browse/PIG-2802 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.0 >Reporter: Anitha Raju > > Hi, > Script > {code} > A = load 'test.txt' using PigStorage() AS (x:int,y:int, z:int) ; > B = GROUP A BY x; > C = foreach B generate A.x as s; > describe C; -- C: {s: {(x: int)}} > D = FOREACH B { >E = ORDER A by y; >GENERATE A.x as s; > }; > describe D; -- D: {x: int,y: int,z: int} > {code} > Here E is a dangling alias. > Regards, > Anitha -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3051) java.lang.IndexOutOfBoundsException failure with LimitOptimizer + ColumnPruning
[ https://issues.apache.org/jira/browse/PIG-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3051: -- Attachment: pig-3051-v2.1-withe2etest.txt Thanks Rohini for the review. bq .But found an issue with the copy not setting the label, type and Uid. I wasn't sure why my test worked even when the above fields were not set. Turns out that they are filled by the SchemaPatcher. LimitOptimizer.reportChanges() simply returns currentPlan, so SchemaPatcher goes through the entire currentPlan including the newSort.mSortColPlans mentioned and update them accordingly. BTW, reading back my patch, I felt that logic of making a new copy of LOSort should be kept inside LOSort.java. Uploading a new version. Logic is same from the previous patch. > java.lang.IndexOutOfBoundsException failure with LimitOptimizer + > ColumnPruning > > > Key: PIG-3051 > URL: https://issues.apache.org/jira/browse/PIG-3051 > Project: Pig > Issue Type: Bug > Components: parser >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.11 > > Attachments: pig-3051-v1.1-withe2etest.txt, > pig-3051-v1-withouttest.txt, pig-3051-v2.1-withe2etest.txt > > > Had a user hitting > "Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1" error > when he had multiple stores and limit in his code. > I couldn't reproduce this with short pig code (due to ColumnPruning somehow > not happening when shortened), but here's a snippet. > {noformat} > ... > G3 = FOREACH G2 GENERATE sortCol, FLATTEN(group) as label, (long)COUNT(G1) as > cnt; > G4 = ORDER G3 BY cnt DESC PARALLEL 25; > ONEROW = LIMIT G4 1; > U1 = FOREACH ONEROW GENERATE 3 as sortcol, 'somelabel' as label, cnt; > store U1 into 'u1' using PigStorage(); > store G4 into 'g4' using PigStorage(); > {noformat} > With '-t ColumnMapKeyPrune', job didn't hit the error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3100) If a .pig_schema file is present, can get an index out of bounds error
[ https://issues.apache.org/jira/browse/PIG-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537287#comment-13537287 ] Koji Noguchi commented on PIG-3100: --- I should have commented on PIG-3056, but when our users hit this issue, those affected records tend to contain a record-separator as part of the data by mistake. And that resulted in a single record separated into two incomplete ones. For that case, I wasn't sure if we wanted to fill the incomplete records with null or have an option like PIG-3059. > If a .pig_schema file is present, can get an index out of bounds error > -- > > Key: PIG-3100 > URL: https://issues.apache.org/jira/browse/PIG-3100 > Project: Pig > Issue Type: Bug >Reporter: Jonathan Coveney >Assignee: Jonathan Coveney > Fix For: 0.12 > > Attachments: PIG-3100-0_nows.patch, PIG-3100-0.patch > > > In the case that a .pig_schema file is present, if you have a record with > fewer than expected fields, pig errors out with an index out of bounds > exception that is annoying, unnecessary, and unhelpful. > Instead of improving logging, I decided to just do what pig should do, which > is fill in the records. > Patch will include a test and the fix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3100) If a .pig_schema file is present, can get an index out of bounds error
[ https://issues.apache.org/jira/browse/PIG-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537302#comment-13537302 ] Koji Noguchi commented on PIG-3100: --- bq. Perhaps there can be a flag or setting for PigStorage that is a "strict" mode That sounds like a nice feature to have. Come to think of it, problem of delimiters are not unique to this .pig_schema file loading. > If a .pig_schema file is present, can get an index out of bounds error > -- > > Key: PIG-3100 > URL: https://issues.apache.org/jira/browse/PIG-3100 > Project: Pig > Issue Type: Bug >Reporter: Jonathan Coveney >Assignee: Jonathan Coveney > Fix For: 0.12 > > Attachments: PIG-3100-0_nows.patch, PIG-3100-0.patch > > > In the case that a .pig_schema file is present, if you have a record with > fewer than expected fields, pig errors out with an index out of bounds > exception that is annoying, unnecessary, and unhelpful. > Instead of improving logging, I decided to just do what pig should do, which > is fill in the records. > Patch will include a test and the fix. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3102) Option for PigStorage load to error out when input record is incomplete (instead of filling in null)
Koji Noguchi created PIG-3102: - Summary: Option for PigStorage load to error out when input record is incomplete (instead of filling in null) Key: PIG-3102 URL: https://issues.apache.org/jira/browse/PIG-3102 Project: Pig Issue Type: New Feature Reporter: Koji Noguchi Priority: Minor Continuing from PIG-3100. If users know that all input records have correct number of fields, then enforcing that (with option) would let us catch any input corruption early. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"
Koji Noguchi created PIG-3147: - Summary: Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called" Key: PIG-3147 URL: https://issues.apache.org/jira/browse/PIG-3147 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11 Reporter: Koji Noguchi Priority: Blocker Tried 0.11 jar with spilling, my job failed to spill with the following stack trace. Anyone else seeing this? {noformat} java.lang.RuntimeException: InternalCachedBag.spill() should not be called at org.apache.pig.data.InternalCachedBag.spill(InternalCachedBag.java:167) at org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) Exception in thread "Low Memory Detector" java.lang.InternalError: Error in invoking listener at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:141) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"
[ https://issues.apache.org/jira/browse/PIG-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565698#comment-13565698 ] Koji Noguchi commented on PIG-3147: --- Dumping a stacktrace when InternalCacheBag is added to the SpillableMemoryManager.spillables, {noformat} java.lang.Exception: Stack trace at java.lang.Thread.dumpStack(Thread.java:1206) at org.apache.pig.impl.util.SpillableMemoryManager.registerSpillable(SpillableMemoryManager.java:296) at org.apache.pig.data.DefaultAbstractBag.markSpillableIfNecessary(DefaultAbstractBag.java:101) at org.apache.pig.data.InternalCachedBag.addDone(InternalCachedBag.java:131) at org.apache.pig.data.InternalCachedBag.iterator(InternalCachedBag.java:159) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:456) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSortedDistinct.getNext(POSortedDistinct.java:62) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:432) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:581) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:228) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:282) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:416) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:348) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) {noformat} Is this from a change in PIG-2923? > Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() > should not be called" > --- > > Key: PIG-3147 > URL: https://issues.apache.org/jira/browse/PIG-3147 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.11 >Reporter: Koji Noguchi >Priority: Blocker > > Tried 0.11 jar with spilling, my job failed to spill with the following stack > trace. Anyone else seeing this? > {noformat} > java.lang.RuntimeException: InternalCachedBag.spill() should not be called > at > o
[jira] [Commented] (PIG-2923) Lazily register bags with SpillableMemoryManager
[ https://issues.apache.org/jira/browse/PIG-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565705#comment-13565705 ] Koji Noguchi commented on PIG-2923: --- Hi Dmitriy, I'm seeing a weird error when pig 0.11 tries to spill. Can this change be related? Opened PIG-3147. > Lazily register bags with SpillableMemoryManager > > > Key: PIG-2923 > URL: https://issues.apache.org/jira/browse/PIG-2923 > Project: Pig > Issue Type: Improvement >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.11 > > Attachments: bagspill_delayed_register.patch, bagspill_delay.patch > > > Currently, all Spillable DataBags get registered by the BagFactory at the > moment of creation. In practice, a lot of these bags will not get large > enough to be worth spilling; we can avoid a lot of memory overhead and > cheapen the process of finding a bag to spill when we do need it, by allowing > Bags themselves to register when they grow to some respectable threshold. > Related JIRAs: PIG-2917, PIG-2918 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"
[ https://issues.apache.org/jira/browse/PIG-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3147: -- Attachment: pig-3147-v01.txt Reading PIG-975, InternalCachedBag should not register with SpillableMemoryManager. This is just a pure guess, but uploading a patch that takes out markSpillableIfNecessary from InternalCachedBag.java. > Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() > should not be called" > --- > > Key: PIG-3147 > URL: https://issues.apache.org/jira/browse/PIG-3147 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.11 >Reporter: Koji Noguchi >Priority: Blocker > Attachments: pig-3147-v01.txt > > > Tried 0.11 jar with spilling, my job failed to spill with the following stack > trace. Anyone else seeing this? > {noformat} > java.lang.RuntimeException: InternalCachedBag.spill() should not be called > at > org.apache.pig.data.InternalCachedBag.spill(InternalCachedBag.java:167) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > Exception in thread "Low Memory Detector" java.lang.InternalError: Error in > invoking listener > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:141) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
Koji Noguchi created PIG-3148: - Summary: OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag. Key: PIG-3148 URL: https://issues.apache.org/jira/browse/PIG-3148 Project: Pig Issue Type: Improvement Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi Our user reported that one of their jobs in pig 0.10 occasionally failed with 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but rerunning it sometimes finishes successfully. For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag with 300-400MBytes each when failing with OOM. Jstack at the time of OOM always showed that spill was running. {noformat} "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable [0xb9afc000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked <0xe57c6390> (a java.io.BufferedOutputStream) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked <0xe57c60b8> (a java.io.DataOutputStream) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) - locked <0xceb16190> (a java.util.ArrayList) at org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) - locked <0xbeb86318> (a java.util.LinkedList) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565975#comment-13565975 ] Koji Noguchi commented on PIG-3148: --- I cannot attach the original query, but to give you an idea {noformat} A = LOAD '$INPUT' USING MyLoader('\u0001') AS ( val1, val2, val3, val4, val5, val6, val7, val8, val9, val10, val11, val12, val13, val14, val15, val16); B = GROUP A BY (val1, val2, val3, val8) PARALLEL $NUM_REDUCERS; C = FOREACH B { D = FILTER A BY (val3 == 'status1' AND val5 == 'status2'); E = D.val4; F = DISTINCT E; G = FILTER D BY val7 == 'status3'; GENERATE group.val1, group.val2, group.val8, COUNT(F), COUNT(G), SUM(G.val9), SUM(G.val10), SUM(G.val11), SUM(A.val12), SUM(A.val13), SUM(A.val15), SUM(A.val14), SUM(A.val16); } STORE C INTO '$OUTPUT' USING PigStorage('\u0001'); {noformat} Assuming that this script does not require a two huge DefaultDataBag, looked into SpillableMemoryManager.handleNotification. 'handleNotification' is called whenever certain memory condition is met but not necessary after the gc(). What was happening on this user's case was, (i) 400MB of DefaultDataBag#1 goes stale. (ii) SpillableMemoryManager.handleNotification is triggered. (iii) Since gc() is not called yet, WeakReference is still valid and pig decides to spill holding the lock of this DefaultDataBag#1.mContents(ArrayList). (iv) While reduce task is working on another 400MB DefaultDataBag#2, jvm heap gets full and gc is called. Even though no one is using the stale DefaultDataBag#1, it cannot be GC-ed since spill is holding the lock. As a result, we end up with two DefaultDataBag leading to OOM. > OutOfMemory exception while spilling stale DefaultDataBag. Extra option to > gc() before spilling large bag. > -- > > Key: PIG-3148 > URL: https://issues.apache.org/jira/browse/PIG-3148 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > > Our user reported that one of their jobs in pig 0.10 occasionally failed with > 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but > rerunning it sometimes finishes successfully. > For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag > with 300-400MBytes each when failing with OOM. > Jstack at the time of OOM always showed that spill was running. > {noformat} > "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable > [0xb9afc000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > - locked <0xe57c6390> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > - locked <0xe57c60b8> (a java.io.DataOutputStream) > at java.io.FilterOutputStream.write(FilterOutputStream.java:80) > at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) > at > org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) > at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) > at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) > - locked <0xceb16190> (a java.util.ArrayList) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > - locked <0xbeb86318> (a java.util.LinkedList) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3148: -- Attachment: pig-3148-v01.patch Uploading a patch that adds a feature that would call System.gc() when Spillable is bigger than 'pig.spill.extragc.size.threshold' before spilling. This extra gc() is called at most once per handleNotification and also disabled as default since adding GC has a risk of changing the performance drastically. For the job I was looking, adding '-Dpig.spill.extragc.size.threshold=1' let the job run successfully with no OOM errors. (Note: Separate spill issues before this patch on 0.11 is tracked at PIG-3147.) > OutOfMemory exception while spilling stale DefaultDataBag. Extra option to > gc() before spilling large bag. > -- > > Key: PIG-3148 > URL: https://issues.apache.org/jira/browse/PIG-3148 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3148-v01.patch > > > Our user reported that one of their jobs in pig 0.10 occasionally failed with > 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but > rerunning it sometimes finishes successfully. > For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag > with 300-400MBytes each when failing with OOM. > Jstack at the time of OOM always showed that spill was running. > {noformat} > "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable > [0xb9afc000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > - locked <0xe57c6390> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > - locked <0xe57c60b8> (a java.io.DataOutputStream) > at java.io.FilterOutputStream.write(FilterOutputStream.java:80) > at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) > at > org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) > at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) > at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) > - locked <0xceb16190> (a java.util.ArrayList) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > - locked <0xbeb86318> (a java.util.LinkedList) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
Koji Noguchi created PIG-3178: - Summary: Print a stacktrace when ExecutableManager hits an OOM Key: PIG-3178 URL: https://issues.apache.org/jira/browse/PIG-3178 Project: Pig Issue Type: Improvement Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Trivial When looking at user's pig streaming failing with OOM, it only showed 2013-02-09 03:35:08,694 ERROR [Thread-14] org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: Java heap space It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Attachment: pig-3178-trunk-v01.patch Adding printStackTrace call. Since it's only a logging change, no test is being added. > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Status: Patch Available (was: Open) > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575869#comment-13575869 ] Koji Noguchi commented on PIG-3178: --- bq. Doesn't LOG.error(t); log the stacktrace? I thought it only prints out the cause. After adding the printStackTrace call, log showed {noformat} java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:122) at org.apache.pig.builtin.PigStreaming.serialize(PigStreaming.java:76) at org.apache.pig.impl.streaming.InputHandler.putNext(InputHandler.java:66) at org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run(ExecutableManager.java:367) {noformat} which helped me identify the killer record filling up the heap. > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Status: Open (was: Patch Available) > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Attachment: pig-3178-trunk-v02.patch bq. Can you just add a message - LOG.error("Error running blah blah", t); - so that the stacktrace gets logged. Ah, i see. Uploading. Confirmed that this also logs the stacktrace. Now in syslog, even better. > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Status: Patch Available (was: Open) > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM
[ https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3178: -- Attachment: pig-3178-trunk-v03.patch Sorry for the spam. Rohini pointed out that last patch was based from 0.10 and not trunk. Reloading. > Print a stacktrace when ExecutableManager hits an OOM > - > > Key: PIG-3178 > URL: https://issues.apache.org/jira/browse/PIG-3178 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch, > pig-3178-trunk-v03.patch > > > When looking at user's pig streaming failing with OOM, it only showed > 2013-02-09 03:35:08,694 ERROR [Thread-14] > org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError: > Java heap space > It would have been nice if it also showed the stack trace. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3179) Task Information Header only prints out the first split for each task
Koji Noguchi created PIG-3179: - Summary: Task Information Header only prints out the first split for each task Key: PIG-3179 URL: https://issues.apache.org/jira/browse/PIG-3179 Project: Pig Issue Type: Improvement Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Trivial When a task's PigSplit is containing more than wrappedSplit, it only logs the first fileinfo. When debugging, I saw {noformat} = Task Information Header = Command: bash Start time: Mon Feb 11 16:41:21 UTC 2013 Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2 Input-split start-offset: 0Input-split length: 11854247 {noformat} but the actual error was happing while reading part-r-7.bz2. It would have been nice if the log showed all the info that task was going to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task
[ https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3179: -- Attachment: pig-3179-v01.patch Added for-loop to print all the splits. > Task Information Header only prints out the first split for each task > - > > Key: PIG-3179 > URL: https://issues.apache.org/jira/browse/PIG-3179 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3179-v01.patch > > > When a task's PigSplit is containing more than wrappedSplit, it only logs the > first fileinfo. > When debugging, I saw > {noformat} > = Task Information Header = > Command: bash > Start time: Mon Feb 11 16:41:21 UTC 2013 > Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2 > Input-split start-offset: 0Input-split length: 11854247 > {noformat} > but the actual error was happing while reading part-r-7.bz2. It would > have been nice if the log showed all the info that task was going to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task
[ https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3179: -- Attachment: pig-3179-v02.patch Changed based on Rohini's suggestion. Added extra line printing out the number of input splits. {noformat} PigSplit contains 11 wrappedSplits. Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00032.bz2 start-offset=0 length=11814548 Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00033.bz2 start-offset=0 length=11953088 Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00034.bz2 start-offset=0 length=12122182 Input-split: file=hdfs://abc.def... ... {noformat} > Task Information Header only prints out the first split for each task > - > > Key: PIG-3179 > URL: https://issues.apache.org/jira/browse/PIG-3179 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3179-v01.patch, pig-3179-v02.patch > > > When a task's PigSplit is containing more than wrappedSplit, it only logs the > first fileinfo. > When debugging, I saw > {noformat} > = Task Information Header = > Command: bash > Start time: Mon Feb 11 16:41:21 UTC 2013 > Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2 > Input-split start-offset: 0Input-split length: 11854247 > {noformat} > but the actual error was happing while reading part-r-7.bz2. It would > have been nice if the log showed all the info that task was going to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task
[ https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3179: -- Attachment: pig-3179-v03.patch bq. Hi Koji Noguchi, minor thing - calling toString on Path might be redundant? Thanks Prashant. Updated patch. > Task Information Header only prints out the first split for each task > - > > Key: PIG-3179 > URL: https://issues.apache.org/jira/browse/PIG-3179 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3179-v01.patch, pig-3179-v02.patch, > pig-3179-v03.patch > > > When a task's PigSplit is containing more than wrappedSplit, it only logs the > first fileinfo. > When debugging, I saw > {noformat} > = Task Information Header = > Command: bash > Start time: Mon Feb 11 16:41:21 UTC 2013 > Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2 > Input-split start-offset: 0Input-split length: 11854247 > {noformat} > but the actual error was happing while reading part-r-7.bz2. It would > have been nice if the log showed all the info that task was going to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task
[ https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3179: -- Attachment: pig-3179-v04.patch bq. Can you reuse the StringBuilder. i.e move StringBuilder sb = new StringBuilder(); outside of the loop and inside the loop set sb.setLength(0); Attaching an updated patch. > Task Information Header only prints out the first split for each task > - > > Key: PIG-3179 > URL: https://issues.apache.org/jira/browse/PIG-3179 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3179-v01.patch, pig-3179-v02.patch, > pig-3179-v03.patch, pig-3179-v04.patch > > > When a task's PigSplit is containing more than wrappedSplit, it only logs the > first fileinfo. > When debugging, I saw > {noformat} > = Task Information Header = > Command: bash > Start time: Mon Feb 11 16:41:21 UTC 2013 > Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2 > Input-split start-offset: 0Input-split length: 11854247 > {noformat} > but the actual error was happing while reading part-r-7.bz2. It would > have been nice if the log showed all the info that task was going to read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577743#comment-13577743 ] Koji Noguchi commented on PIG-3148: --- Rohini asked me to clarify why I'm adding extra param instead of simply calling gc() at the top of handleNotification(). Reason I added extra param is, * When I tried just adding gc() at the top, suddenly I saw all of my mappers stuck, spending 99% of cputime on gc. I then learned that handleNotification is called much more frequently than I first anticipated when the application is using more than the threshold and have nothing much to spill. That convinced me to add more condition to reduce the gc() calls. * Motivation of my patch here is to avoid OutOfMemory when application is holding a reference to a large stale bag while spilling unnecessarily. For that, bag being spilled has to be proportional to the heap size of the application to cause OOM. > OutOfMemory exception while spilling stale DefaultDataBag. Extra option to > gc() before spilling large bag. > -- > > Key: PIG-3148 > URL: https://issues.apache.org/jira/browse/PIG-3148 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3148-v01.patch > > > Our user reported that one of their jobs in pig 0.10 occasionally failed with > 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but > rerunning it sometimes finishes successfully. > For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag > with 300-400MBytes each when failing with OOM. > Jstack at the time of OOM always showed that spill was running. > {noformat} > "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable > [0xb9afc000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > - locked <0xe57c6390> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > - locked <0xe57c60b8> (a java.io.DataOutputStream) > at java.io.FilterOutputStream.write(FilterOutputStream.java:80) > at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) > at > org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) > at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) > at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) > - locked <0xceb16190> (a java.util.ArrayList) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > - locked <0xbeb86318> (a java.util.LinkedList) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578507#comment-13578507 ] Koji Noguchi commented on PIG-3148: --- Thanks Dmitriy, Rohini! I like the fixed ratio suggestion. Is 5% ok? Maybe 10%? Also, do we still want a configurable flag to enable this feature? > OutOfMemory exception while spilling stale DefaultDataBag. Extra option to > gc() before spilling large bag. > -- > > Key: PIG-3148 > URL: https://issues.apache.org/jira/browse/PIG-3148 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3148-v01.patch > > > Our user reported that one of their jobs in pig 0.10 occasionally failed with > 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but > rerunning it sometimes finishes successfully. > For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag > with 300-400MBytes each when failing with OOM. > Jstack at the time of OOM always showed that spill was running. > {noformat} > "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable > [0xb9afc000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > - locked <0xe57c6390> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > - locked <0xe57c60b8> (a java.io.DataOutputStream) > at java.io.FilterOutputStream.write(FilterOutputStream.java:80) > at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) > at > org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) > at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) > at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) > - locked <0xceb16190> (a java.util.ArrayList) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > - locked <0xbeb86318> (a java.util.LinkedList) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3148: -- Attachment: pig-3148-v02.patch Sorry for the delay. Attaching a patch with suggested change. > OutOfMemory exception while spilling stale DefaultDataBag. Extra option to > gc() before spilling large bag. > -- > > Key: PIG-3148 > URL: https://issues.apache.org/jira/browse/PIG-3148 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3148-v01.patch, pig-3148-v02.patch > > > Our user reported that one of their jobs in pig 0.10 occasionally failed with > 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but > rerunning it sometimes finishes successfully. > For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag > with 300-400MBytes each when failing with OOM. > Jstack at the time of OOM always showed that spill was running. > {noformat} > "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable > [0xb9afc000] >java.lang.Thread.State: RUNNABLE > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > - locked <0xe57c6390> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.write(DataOutputStream.java:90) > - locked <0xe57c60b8> (a java.io.DataOutputStream) > at java.io.FilterOutputStream.write(FilterOutputStream.java:80) > at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) > at > org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) > at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) > at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) > at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) > - locked <0xceb16190> (a java.util.ArrayList) > at > org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) > - locked <0xbeb86318> (a java.util.LinkedList) > at > sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) > at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) > at > sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) > at sun.management.Sensor.trigger(Sensor.java:120) > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2597) Move grunt from javacc to ANTRL
[ https://issues.apache.org/jira/browse/PIG-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603583#comment-13603583 ] Koji Noguchi commented on PIG-2597: --- bq. Jonathan, any update on this? I'm interested in this status as well. Does Boski have a plan to continue working on this? > Move grunt from javacc to ANTRL > --- > > Key: PIG-2597 > URL: https://issues.apache.org/jira/browse/PIG-2597 > Project: Pig > Issue Type: Improvement >Reporter: Jonathan Coveney > Labels: GSoC2012 > Attachments: pig02.diff > > > Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The > parser is very difficult to work with, and next to impossible to understand > or modify. ANTLR provides a much cleaner, more standard way to generate > parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we > continue to add features to Pig. > This is a candidate project for Google summer of code 2012. More information > about the program can be found at > https://cwiki.apache.org/confluence/display/PIG/GSoc2012 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
Koji Noguchi created PIG-3251: - Summary: Bzip2TextInputFormat requires double the memory of maximum record size Key: PIG-3251 URL: https://issues.apache.org/jira/browse/PIG-3251 Project: Pig Issue Type: Improvement Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Minor While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat consumes memory at both Bzip2TextInputFormat.buffer (ByteArrayOutputStream) and actual Text that is returned as line. For example, when having one record with 160MBytes, buffer was 268MBytes and Text was 160MBytes. We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3251: -- Attachment: pig-3251-trunk-v01.patch In Bzip2TextInputFormat, it says {code} /** * Provide a bridge to get the bytes from the ByteArrayOutputStream without * creating a new byte array. */ private static class TextStuffer extends OutputStream { {code} However, in reality, Text just creates a new bytearray and copy the content. Attaching a patch that is similar to the approach taken by org.apache.hadoop.util.LineReader but with less changes since HADOOP-4012(added in 0.21) was a huge patch. This patch basically reads into the fixed-length-buffer and appends to Text whenever it gets full. Touching BZip2LineRecordReader makes me nervous so I wanted the changes to be small. I need to do more testings to see if this approach works or not. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606445#comment-13606445 ] Koji Noguchi commented on PIG-3251: --- bq. Let me know if you find any problem in your testing. Thanks Daniel. My initial test went well on 0.23 cluster. It was as fast as the original and requiring less memory. However, the patched pig is super slow on 1.0.2 cluster. Reason is, I'm using the Text directly as the replacement of ByteArrayOutputStream. Without HADOOP-6109 which was committed in 0.21, Text grows linearly whereas ByteArrayOutputStream grows exponentially requiring a lot more copies for the former. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3251: -- Attachment: pig-3251-trunk-v02.patch (1) Current status (before any patch) ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY. This Jira. | | 0.23 | [ii] Good. | (iv) Needs EXTRA MEMORY. This Jira. | (2) My initial patch (pig-3251-trunk-v01.patch) changes this to ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Slow due to HADOOP-6109 | | 0.23 | [ii] Good. | (iv) Good | (3) If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves all the problem. ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20+Hadoop-6109 | [i] Good| (iii) Good | | 0.23 | [ii] Good. | (iv) Good | However, I've seen a discussion about pig supporting 0.20.2 users. So I guess we can't ask them to backport HADOOP-6109 then. I think my remaining options are (a) Give up. Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109 to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*. (b) Try to workaround without touching hadoop code. I think (a) is reasonable but tried (b). This patch makes the status as below. (4) Patch (pig-3251-trunk-v02.patch) ||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java || | 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Good | | 0.23 | [ii] Good. | (iv) Good | Penalty of not touching the hadoop code is, my patch adds two unnecessary bytearray copies when extending the Text size. But frequency is low due to exponentially increasing sizes, so I hope the overall overhead is negligible. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607860#comment-13607860 ] Koji Noguchi commented on PIG-3251: --- bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use PigTextInputFormat? That'll (almost) have the same effect of my initial patch pig-3251-trunk-v01.patch which takes to status (2) in my previous comment. With HADOOP-7823 + HADOOP-6109, then it'll be (3). Without a doubt, HADOOP-7823 + HADOOP-6109 is the cleanest approach. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607886#comment-13607886 ] Koji Noguchi commented on PIG-3251: --- bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use PigTextInputFormat? Since our platform has moved to 0.23, I'll be happy if we can simply remove Bzip2TextInputFormat just for hadoop 0.23 or later. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3251: -- Attachment: pig-3251-trunk-v03.patch bq. Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use Bzip2TextInputFormat otherwise for backward compatibility. Would something like this work? pig-3251-trunk-v03.patch uses PigTextInputFormat even for bzip if TextInputFormat can split them. (I'll update the other FileInputLoadFunc if this change looks ok. Also, this works with 'bz2' extension but not for 'bz' unless config is added.) > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize
[ https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608227#comment-13608227 ] Koji Noguchi commented on PIG-3255: --- +1 Looks good to me. Probably another Jira, but I wonder if we really need to create new Text for every streaming outputs. Can we reuse it with value.clear() ? (But if we do this, then in most cases value.getBytes().length <> value.getLength().) > Avoid extra byte array copy in streaming deserialize > > > Key: PIG-3255 > URL: https://issues.apache.org/jira/browse/PIG-3255 > Project: Pig > Issue Type: Bug >Affects Versions: 0.11 >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.12 > > Attachments: PIG-3255-1.patch > > > PigStreaming.java: > public Tuple deserialize(byte[] bytes) throws IOException { > Text val = new Text(bytes); > return StorageUtil.textToTuple(val, fieldDel); > } > Should remove new Text(bytes) copy and construct the tuple directly from the > bytes -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
Koji Noguchi created PIG-3266: - Summary: Pig takes forever to parse scripts with foreach + multi level binconds Key: PIG-3266 URL: https://issues.apache.org/jira/browse/PIG-3266 Project: Pig Issue Type: Bug Affects Versions: 0.11, 0.10.0 Reporter: Koji Noguchi Following pig script parsing takes * 1 second in pig-0.8 * 90 seconds in pig-0.9 * forever in pig-0.10 (it's taking literally hours) {noformat} A = load 'input.txt' as (mynum:float, mychar:chararray); B = foreach A generate mychar, (mynum < 0 ? 0 : (mynum < 1 ? 1 : (mynum < 2 ? 2 : (mynum < 3 ? 3 : (mynum < 4 ? 4 : (mynum < 5 ? 5 : (mynum < 6 ? 6 : (mynum < 7 ? 7 : (mynum < 8 ? 8 : (mynum < 9 ? 9 : (mynum < 10 ? 10 : (mynum < 11 ? 11 : (mynum < 12 ? 12 : (mynum < 13 ? 13 : (mynum < 14 ? 14 : (mynum < 15 ? 15 : (mynum < 16 ? 16 : (mynum < 17 ? 17 : (mynum < 18 ? 18 : (mynum < 19 ? 19 : (mynum < 20 ? 20 : 21); dump A; {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13620098#comment-13620098 ] Koji Noguchi commented on PIG-3266: --- If I revert the change from PIG:1387, parsing speed comes back to 90 seconds (pig-0.9 level) {noformat} src/org/apache/pig/parser/QueryParser.g -projectable_expr: func_eval | col_ref | bin_expr | type_conversion +projectable_expr: func_eval | col_ref | bin_expr {noformat} I don't know anything about antlr, but I guess it cannot tell whether the given tokens are bin_expr or type_conversion when starting with '(' so spending extra cycles to check both. > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13620158#comment-13620158 ] Koji Noguchi commented on PIG-3266: --- bq. Does it finish in the end, or never? I would guess it'll finish but I don't know. It has been running for 4 hours now. > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621115#comment-13621115 ] Koji Noguchi commented on PIG-3266: --- > > Does it finish in the end, or never? > I would guess it'll finish but I don't know. It has been running for 4 hours > now. > I had to kill it after 28 hours of never-ending parsing... > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended
[ https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621155#comment-13621155 ] Koji Noguchi commented on PIG-3261: --- I prefer with PIG_USER_CLASSPATH_FIRST. I've seen too many random users including old pig jar in their custom UDFs... In our environment, we perform QE on set of frameworks. (hadoop, pig, oozie, etc) And we tell our users, whenever they set HADOOP_USER_CLASSPATH_FIRST they are running outside of the QA-ed environment. I want the same to apply within pig with PIG_USER_CLASSPATH_FIRST. > User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not > appended > --- > > Key: PIG-3261 > URL: https://issues.apache.org/jira/browse/PIG-3261 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.10.0 >Reporter: Harsh J >Assignee: Harsh J > Attachments: PIG-3261.patch, PIG-3261.patch > > > Currently we are doing this wrong: > {code} > if [ "$PIG_CLASSPATH" != "" ]; then > CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH} > {code} > This means that anything added to CLASSPATH until that point will never be > able to get overridden by a user set environment, which is wrong behavior. > Hadoop libs for example are added to CLASSPATH, before this extension is > called in bin/pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621178#comment-13621178 ] Koji Noguchi commented on PIG-3266: --- bq. I assume there is an infinite loop. Next time could you do a jstack before killing pig process and attach it here for the record? A bit confused. I can certainly do that, but are you saying you cannot reproduce this issue on your side using my test script? If so, I need to look at my test environment more carefully. > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621230#comment-13621230 ] Koji Noguchi commented on PIG-3266: --- bq. Koji Noguchi, I think this was fixed. I don't see the issue on trunk. Just realize that. Thanks! Can you show me which jira fixed this? I should have tested with trunk before creating this jira. I think I even tried with pig-0.11 to confirm the problem. > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi resolved PIG-3266. --- Resolution: Duplicate Release Note: Found it. This is a duplicate of PIG-2769. Sorry Xuefu for wasting your time on this! > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds
[ https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3266: -- Release Note: (was: Found it. This is a duplicate of PIG-2769. Sorry Xuefu for wasting your time on this!) > Pig takes forever to parse scripts with foreach + multi level binconds > --- > > Key: PIG-3266 > URL: https://issues.apache.org/jira/browse/PIG-3266 > Project: Pig > Issue Type: Bug >Affects Versions: 0.10.0, 0.11 >Reporter: Koji Noguchi > > Following pig script parsing takes > * 1 second in pig-0.8 > * 90 seconds in pig-0.9 > * forever in pig-0.10 (it's taking literally hours) > {noformat} > A = load 'input.txt' as (mynum:float, mychar:chararray); > B = foreach A generate mychar, > (mynum < 0 ? 0 : > (mynum < 1 ? 1 : > (mynum < 2 ? 2 : > (mynum < 3 ? 3 : > (mynum < 4 ? 4 : > (mynum < 5 ? 5 : > (mynum < 6 ? 6 : > (mynum < 7 ? 7 : > (mynum < 8 ? 8 : > (mynum < 9 ? 9 : > (mynum < 10 ? 10 : > (mynum < 11 ? 11 : > (mynum < 12 ? 12 : > (mynum < 13 ? 13 : > (mynum < 14 ? 14 : > (mynum < 15 ? 15 : > (mynum < 16 ? 16 : > (mynum < 17 ? 17 : > (mynum < 18 ? 18 : > (mynum < 19 ? 19 : > (mynum < 20 ? 20 : 21); > dump A; > {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0
[ https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622731#comment-13622731 ] Koji Noguchi commented on PIG-2769: --- bq. We should put this into 0.11 branch, maybe there will be an 0.11.2 before 12 comes out. If we can fix this in 0.11, that would be really nice. On our clusters, there were multiple users hit with this issue on 0.10. > a simple logic causes very long compiling time on pig 0.10.0 > > > Key: PIG-2769 > URL: https://issues.apache.org/jira/browse/PIG-2769 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.10.0 > Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported) >Reporter: Dan Li >Assignee: Nick White > Fix For: 0.12 > > Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, > PIG-2769.2.patch, > TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt > > > We found the following simple logic will cause very long compiling time for > pig 0.10.0, while using pig 0.8.1, everything is fine. > A = load 'A.txt' using PigStorage() AS (m: int); > B = FOREACH A { > days_str = (chararray) > (m == 1 ? 31: > (m == 2 ? 28: > (m == 3 ? 31: > (m == 4 ? 30: > (m == 5 ? 31: > (m == 6 ? 30: > (m == 7 ? 31: > (m == 8 ? 31: > (m == 9 ? 30: > (m == 10 ? 31: > (m == 11 ? 30:31))); > GENERATE >days_str as days_str; > } > store B into 'B'; > and here's a simple input file example: A.txt > 1 > 2 > 3 > The pig version we used in the test > Apache Pig version 0.10.0-SNAPSHOT (rexported) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3270) Union onschema failing at runtime when merging incompatible types
Koji Noguchi created PIG-3270: - Summary: Union onschema failing at runtime when merging incompatible types Key: PIG-3270 URL: https://issues.apache.org/jira/browse/PIG-3270 Project: Pig Issue Type: Bug Reporter: Koji Noguchi {noformat} t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); tout = UNION ONSCHEMA t1, t2; dump tout; {noformat} Job fails with 2013-04-09 11:37:37,817 [Thread-12] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan. at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:680) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3270) Union onschema failing at runtime when merging incompatible types
[ https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626733#comment-13626733 ] Koji Noguchi commented on PIG-3270: --- Before PIG-2071, this job would have dumped field 'b' as chararray instead of failing at the middle at runtime. Reading that jira, I'm thinking this example should have failed at compile time with better error messages. Am I understanding it correctly? > Union onschema failing at runtime when merging incompatible types > - > > Key: PIG-3270 > URL: https://issues.apache.org/jira/browse/PIG-3270 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi > > {noformat} > t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); > t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); > tout = UNION ONSCHEMA t1, t2; > dump tout; > {noformat} > Job fails with > 2013-04-09 11:37:37,817 [Thread-12] WARN > org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 > java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 2055: Received Error while processing the map plan. > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: > Received Error while processing the map plan. > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3271) POSplit ignoring error from input processing giving empty results
Koji Noguchi created PIG-3271: - Summary: POSplit ignoring error from input processing giving empty results Key: PIG-3271 URL: https://issues.apache.org/jira/browse/PIG-3271 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Priority: Critical Script below fails at union onschema due to PIG-3270 but pig ignores its error and creates empty outputs with return code 0 (SUCCESS). {noformat} t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); tout = UNION ONSCHEMA t1, t2; STORE tout INTO './out1' USING PigStorage(); STORE tout INTO './out2' USING PigStorage(); {noformat} Is POSplit ignoring the error from input processing? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3271) POSplit ignoring error from input processing giving empty results
[ https://issues.apache.org/jira/browse/PIG-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3271: -- Attachment: pig-3271-v01.patch I'm having hard time tracking the code but this seems to catch the error. > POSplit ignoring error from input processing giving empty results > -- > > Key: PIG-3271 > URL: https://issues.apache.org/jira/browse/PIG-3271 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Critical > Attachments: pig-3271-v01.patch > > > Script below fails at union onschema due to PIG-3270 but pig ignores its > error and creates empty outputs with return code 0 (SUCCESS). > {noformat} > t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); > t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); > tout = UNION ONSCHEMA t1, t2; > STORE tout INTO './out1' USING PigStorage(); > STORE tout INTO './out2' USING PigStorage(); > {noformat} > Is POSplit ignoring the error from input processing? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3271) POSplit ignoring error from input processing giving empty results
[ https://issues.apache.org/jira/browse/PIG-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3271: -- Attachment: pig-3271-v02.patch bq. Ready to go with a testcase. I'm lost in this. Original example pasted on the jira failed due to PIG-3270 and should be fixed thus cannot be used for the testcase for this PIG-3271. Just to show how lost I am, created a test case that connects couple of operators to force the input processing to fail. (I need a testcase that doesn't throw an exception but returns POStatus.STATUS_ERR.) > POSplit ignoring error from input processing giving empty results > -- > > Key: PIG-3271 > URL: https://issues.apache.org/jira/browse/PIG-3271 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Critical > Attachments: pig-3271-v01.patch, pig-3271-v02.patch > > > Script below fails at union onschema due to PIG-3270 but pig ignores its > error and creates empty outputs with return code 0 (SUCCESS). > {noformat} > t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); > t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); > tout = UNION ONSCHEMA t1, t2; > STORE tout INTO './out1' USING PigStorage(); > STORE tout INTO './out2' USING PigStorage(); > {noformat} > Is POSplit ignoring the error from input processing? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3270) Union onschema failing at runtime when merging incompatible types
[ https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3270: -- Attachment: pig-3270-v01.patch bq. We should not insert cast to bytes operation. It's probably in UnionOnSchemaSetter Ah, I see. I saw that job was failing at POCast(DataByteArray) but didn't know that it would work without this cast. Writing a test. > Union onschema failing at runtime when merging incompatible types > - > > Key: PIG-3270 > URL: https://issues.apache.org/jira/browse/PIG-3270 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi > Attachments: pig-3270-v01.patch > > > {noformat} > t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); > t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); > tout = UNION ONSCHEMA t1, t2; > dump tout; > {noformat} > Job fails with > 2013-04-09 11:37:37,817 [Thread-12] WARN > org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 > java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 2055: Received Error while processing the map plan. > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: > Received Error while processing the map plan. > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3270) Union onschema failing at runtime when merging incompatible types
[ https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3270: -- Attachment: pig-3270-v02.patch bq. Writing a test. Since the original job was failing at runtime due to invalid bytearray casting, I added a e2e test. > Union onschema failing at runtime when merging incompatible types > - > > Key: PIG-3270 > URL: https://issues.apache.org/jira/browse/PIG-3270 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi > Attachments: pig-3270-v01.patch, pig-3270-v02.patch > > > {noformat} > t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray); > t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float); > tout = UNION ONSCHEMA t1, t2; > dump tout; > {noformat} > Job fails with > 2013-04-09 11:37:37,817 [Thread-12] WARN > org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 > java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 2055: Received Error while processing the map plan. > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: > Received Error while processing the map plan. > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:680) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
Koji Noguchi created PIG-3293: - Summary: Casting fails after Union from two data sources&loaders Key: PIG-3293 URL: https://issues.apache.org/jira/browse/PIG-3293 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Script similar to {noformat} A = load 'data1' using MyLoader() as (a:bytearray); B = load 'data2' as (a:bytearray); C = union onschema A,B; D = foreach C generate (chararray)a; Store D into './out'; {noformat} fails with java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string. Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640568#comment-13640568 ] Koji Noguchi commented on PIG-3293: --- When two inputs are loaded by the same loader, this was handled at PIG-2493. In the case here, I can understand 'funcSpec' would be null for Union/Cast since they are coming from two loaders, but can we still use the caster if both loaders happen to have the same one (Utf8StorageConverter)? > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3293: -- Priority: Minor (was: Major) > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Minor > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640810#comment-13640810 ] Koji Noguchi commented on PIG-3293: --- I may have simplified my user's issue a bit. What I was originally looking at was relation A and B being the join of two input sets and then 'union'ed together. So each field from Union was still coming from a single loader but the cast was still failing. I'll create a separate jira for this since it's an easier fix. For this jira, may I update the error message to suggest typecasting before the union? " ERROR 1075: Received a bytearray from the UDF. " is clearly wrong since UDF is not involved in this script. > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Minor > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
Koji Noguchi created PIG-3295: - Summary: Casting from bytearray failing after Union (even when each field is from a single Loader) Key: PIG-3295 URL: https://issues.apache.org/jira/browse/PIG-3295 Project: Pig Issue Type: Bug Components: parser Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Minor One example {noformat} A = load 'data1.txt' as line:bytearray; B = load 'c1.txt' using TextLoader() as cookie1; C = load 'c2.txt' using TextLoader() as cookie2; B2 = join A by line, B by cookie1; C2 = join A by line, C by cookie2; D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: bytearray,C::cookie2: bytearray} E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) cookie2; dump E; {noformat} This script fails at runtime with "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert the bytearray to string." This is different from PIG-3293 such that each field in 'D' belongs to a single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3295: -- Attachment: pig-3295-v01.patch Attaching an initial patch. Instead of having one FuncSpec per LOUnion (PIG-2493), checking each field and setting different FuncSpec when possible. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642170#comment-13642170 ] Koji Noguchi commented on PIG-3295: --- Forgot to mention, I didn't fix PIG-3293 case but updated the error message to indicate it could be from Union with multiple loaders. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3295: -- Status: Patch Available (was: Open) > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3251: -- Attachment: pig-3251-trunk-v04.patch bq. Also, this works with 'bz2' extension but not for 'bz' unless config is added.) [~rohini] pointed out to me that it's not configurable. My bad. To keep the backward compatibility, added a wrapper codec that uses 'bz' as extension. As for selecting the InputFormat, I can also use hadoopShim and return PigTextInputFormat just for 0.23. Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of having native codec. (HADOOP-8462) > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648464#comment-13648464 ] Koji Noguchi commented on PIG-3251: --- bq. Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of having native codec. (HADOOP-8462) Learned that bzip native codec so far does not support splitting (and falls back to java version for splits). > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3251: -- Attachment: pig-3251-trunk-v05.patch Thanks Daniel. bq.is the patch ready? Ah, forgot to flag it as patch available. bq. can we just cache splittable? Makes complete sense. Changing. bq. Is it possible to wrap a codec deal with both bz2/bz? As far as I understand, hadoop has 1-to-1 mapping for the codec and extension. I don't know of a way to map multiple extensions to one codec. Or, are you suggesting I create two silly wrappers instead of one? > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648745#comment-13648745 ] Koji Noguchi commented on PIG-3251: --- FYI, couple of tests from TestBZip are failing after applying my patch. Looking. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648788#comment-13648788 ] Koji Noguchi commented on PIG-3251: --- bq. FYI, couple of tests from TestBZip are failing after applying my patch. Looking. 3 tests failed. {noformat} Testcase: testBZ2Concatenation took 38.266 sec FAILED Expected exception: java.io.IOException junit.framework.AssertionFailedError: Expected exception: java.io.IOException Testcase: testBlockHeaderEndingWithCR took 49.539 sec FAILED expected:<82094> but was:<82093> junit.framework.AssertionFailedError: expected:<82094> but was:<82093> at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256) at org.apache.pig.test.TestBZip.testBlockHeaderEndingWithCR(TestBZip.java:112) Testcase: testBlockHeaderEndingAtSplitNotByteAligned took 48.996 sec FAILED expected:<74999> but was:<101591> junit.framework.AssertionFailedError: expected:<74999> but was:<101591> at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256) at org.apache.pig.test.TestBZip.testBlockHeaderEndingAtSplitNotByteAligned(TestBZip.java:88) {noformat} "testBZ2Concatenation" is expected since hadoop bzip2 codec handles concatenated bzip files (whereas pig's TestBZip is testing whether it reliably fails). Other two are worrisome to me. Asking my colleague to check. It'll take some time. Depending on what we find, we may need to change the condition for using hadoop's bzip codec. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3293: -- Attachment: pig-3293-test-only-v01.patch bq. Must be the "caster" in D's POCast is null. Can you attach MyLoader? Attaching a test case using {noformat} public class PigStorageWithStatistics extends PigStorage { {noformat} from org.apache.pig.test. Even though both PigStorage and PigStorageWithStatistics returns Utf8StorageConverter, testcase fails with "Cannot determine how to convert the bytearray to string." Note that I created PIG-3295 for dealing with the case when casting fails even when union comes from the same loader. Figuring out if the loaders were same was easy with calling 'equals' for the FuncSpec instances. I don't know how to achieve this easily for comparing casters. > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Minor > Attachments: pig-3293-test-only-v01.patch > > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3310) ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations
[ https://issues.apache.org/jira/browse/PIG-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665415#comment-13665415 ] Koji Noguchi commented on PIG-3310: --- I also don't have a good understanding on these, but the change looks reasonable to me. [~daijy], original uid reassignment was added in PIG-1705 for the self-join. Can you take a look? > ImplicitSplitInserter does not generate new uids for nested schema fields, > leading to miscomputations > - > > Key: PIG-3310 > URL: https://issues.apache.org/jira/browse/PIG-3310 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.11.1 > Environment: Reproduced on 0.10.1, 0.11.1 and trunk >Reporter: Clément Stenac > Attachments: generate-uid-for-nested-fields.patch > > > Hi, > Consider the following example > {code} > inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int); > tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score; > D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId > as shopId, score AS score; > D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId > as shopId, score AS score; > J = JOIN D1 By shopId, D2 by shopId; > K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS > member_id2, D1::shopId as shop; > EXPLAIN K; > DUMP K; > {code} > It is a bit weird written like that, but it provides a minimal reproduction > case (in the real case, the "tuplified" phase came from a multi-key grouping). > On input data: > {code} > 1 1001101 > 1 1002103 > 1 1003102 > 1 1004102 > 2 1005101 > 2 1003101 > 2 1002123 > 3 1042101 > 3 1005101 > 3 1002133 > {code} > This will give a wrongful output like .. > {code} > (1,1001,1001) > (1,1002,1002) > (1,1002,1002) > (1,1002,1002) > {code} > The second column should be a member id so (1,2,3,4,5). > In the initial case, there was a FILTER (member_id1 < member_id2) after K, > and computation failed because of PushUpFilter optimization mistakenly moving > the LOFilter operation before the join, at a place where it tried to work on > a tuple and failed. > My understanding of the issue is that when the ImplicitSplitInserter creates > the LOSplitOutputs, it will correctly reset the schema, and the LOSplitOutput > will regenerate uids for the fields of D1 and D2 ... but will not do that on > the tuple members. > The logical plan after the ImplicitSplitINserter will look like (simplified) > {code} >|---D1: (Name: LOForEach Schema: > memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125, > 124] > |---tuplified: (Name: LOSplitOutput Schema: > tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127] >|---tuplified: (Name: LOSplit Schema: > tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123] > |---D2: (Name: LOForEach Schema: > memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125, > 124] > |---tuplified: (Name: LOSplitOutput Schema: > tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130] >|---tuplified: (Name: LOSplit Schema: > tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123] > {code} > tuplified correctly gets a new uid (127 and 130) but the members of the tuple > don't. When they get reprojected, both branches have the same uid and the > join looks like: > {code} > |---J: (Name: LOJoin(HASH) Schema: > D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125, > 124, 132]ColumnPrune:OutputUids=[125, 124, 132] > | | > | shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1) > | | > | shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1) > {code} > If for example instead of reprojecting "memberId", we project "memberId+0", a > new node is created, and ultimately the two branches of the join will > correctly get separate uids. > My understanding is that LOSplitOutput.getSchema() should recurse on nested > schema fields. However, I only have a light understanding of all of the > logical plan handling, so I may be completely wrong. > Attached is a draft of patch and a test reproducing the issue. Unfortunately, > I haven't been able to run all unit tests with the "fix" (I have some weird > hangs) > I'd be happy if you could indicate if that looks like completely the wron
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668630#comment-13668630 ] Koji Noguchi commented on PIG-3257: --- Would this ensure that same unique identifier is reproduced when (map) task attempt is retried? Otherwise, I'm afraid it would lead to a random pig behavior when we use this id as the map-reduce key. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668705#comment-13668705 ] Koji Noguchi commented on PIG-3257: --- bq. I can't see how it would matter whether it produced random key X1 vs random key X2 for any given record. If used in mapreduce key, this can lead to incomplete/incorrect output when mappers are retried. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668717#comment-13668717 ] Koji Noguchi commented on PIG-3257: --- bq. incomplete/incorrect output I mean, this can result in missing records or redundant records. (support nightmare for me.) > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195 ] Koji Noguchi commented on PIG-3257: --- With your first example, say you have _n_ input records. 1 mapper 2 reducers. {noformat} A = load ... B = group A by UUID(); STORE B ... {noformat} This job could successfully finish with output ranging from 0 to 2n records. For example, sequence of events can be, # mapper0_attempt0 finish with n outputs and say all n uuid keys were assigned to reducer0. # reducer0_attempt0 pulls map outputs and produces _n_ outputs. # reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could be fetch failure or node failure). # mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to reducer1. # reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs. # job finish successfully with 2n outputs. This is certainly unexpected to users. Now, with your second example {noformat} A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader(); B = foreach A generate *, UUID(); C = group B by s; D = foreach C generate flatten(B), SUM(B.i) as sum_b; E = group B by si; F = foreach E generate flatten(B), SUM(B.f) as sum_f; G = join D by uuid, F by uuid; H = foreach G generate D::B::s, sum_b, sum_f; store H into 'output'; {noformat} Let's say pig decides to implement the two group by (C and E) with one map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and assume pig decides to partition all group by in _C_ to reducer0 and _E_ to reducer1. Now, using the same story as above, there could be a case where reducer0(group-by-C) gets one set of UUID from mapper0_attempt0 and reducer1(group-by-E) gets another completely different set of UUID from mapper0_attempt1. When this happen, join _G_ would produce 0 results which is unexpected to users. Of course this depends on how pig performs the above query but I hope it demonstrates how tricky it gets when introducing a pure random id in hadoop. What's worst about all these is that this is a corner case which won't get caught in users' QE phases and it would only manifest during production pipeline. Users would then yell at me for corrupted output from successful jobs. Thus my previous comment on "support nightmare". > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3355) ColumnMapKeyPrune bug with distinct operator
[ https://issues.apache.org/jira/browse/PIG-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686815#comment-13686815 ] Koji Noguchi commented on PIG-3355: --- bq. Committed to trunk. Thanks Jeremy! [~aniket486], status is still "Patch Available"? Also, can we patch 0.11 as well so that it'll be included if we release another 0.11.* ? > ColumnMapKeyPrune bug with distinct operator > > > Key: PIG-3355 > URL: https://issues.apache.org/jira/browse/PIG-3355 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.2, 0.10.1, 0.11.1 >Reporter: Jeremy Karn >Assignee: Jeremy Karn > Attachments: PIG-3355.patch > > > We came across a bug that happens when you have a distinct operator > immediately followed by a union where the result of the union has at least > one column that will be pruned by ColumnMapKeyPrune. There's a test showing > an example script in the submitted patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Attachment: pig-3385-v01.patch Wondering if custom partitioner ever worked for distinct. Looks like partitioner info is passed through POGlobalRearrange but "distinct" doesn't use it. Uploading an initial patch that just passes that info through PODistinct. It's the first time for me to touch the backend code. Appreciate if someone can take a look. I'll upload a testcase next. > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: documentation >Reporter: Will Oberman >Priority: Minor > Attachments: pig-3385-v01.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Component/s: (was: documentation) impl Assignee: Koji Noguchi > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3385: -- Attachment: pig-3385-v02.patch Uploading a patch with test. Noticed that original test for custom partitioners didn't give different partition results than the default so added one silly partitioner that always return 1 (second reducer). > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch, pig-3385-v02.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer
Koji Noguchi created PIG-3435: - Summary: Custom Partitioner not working with MultiQueryOptimizer Key: PIG-3435 URL: https://issues.apache.org/jira/browse/PIG-3435 Project: Pig Issue Type: Bug Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi When looking at PIG-3385, noticed some issues in handling of custom partitioner with multi-query optimization. {noformat} C1 = group B1 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; C2 = group B2 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; {noformat} This seems to be merged to one mapreduce job correctly but custom partitioner information was lost. {noformat} C1 = group B1 by col1 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; C2 = group B2 by col1 parallel 2; {noformat} This seems to be merged even though they should run on two different partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746937#comment-13746937 ] Koji Noguchi commented on PIG-3385: --- While looking at this jira, noticed custom partitioner being dropped when run with multi query optimization. Created PIG-3435. > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3385-v01.patch, pig-3385-v02.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer
[ https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3435: -- Attachment: pig-3435-v01.patch Looking at the multi-query optimization code and documents. I chickened out. Taking the same approach as PIG-1108 and simply skipping the MR jobs with custom partitioner. Attaching the test case soon. > Custom Partitioner not working with MultiQueryOptimizer > --- > > Key: PIG-3435 > URL: https://issues.apache.org/jira/browse/PIG-3435 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3435-v01.patch > > > When looking at PIG-3385, noticed some issues in handling of custom > partitioner with multi-query optimization. > {noformat} > C1 = group B1 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > {noformat} > This seems to be merged to one mapreduce job correctly but custom partitioner > information was lost. > {noformat} > C1 = group B1 by col1 PARTITION BY > org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 parallel 2; > {noformat} > This seems to be merged even though they should run on two different > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer
[ https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3435: -- Attachment: pig-3435-v02_skipcustompatitioner_for_merge.patch While looking at the testcase, found PIG-2627 where it fixed one of the issues with custom-partitioner and multiquery optimization (but not all). Specific case mentioned on that ticket is handled on that jira and it works, but my patch here simply skips multiquery optimization for ALL custom partitioner jobs. Since it's sort of a correctness issue, I want this fix to be back-ported to 0.11. And for that, I kept the change to be simple. Can we create a separate jira for reviving custom-partitioner + multiquery optimization for later releases? > Custom Partitioner not working with MultiQueryOptimizer > --- > > Key: PIG-3435 > URL: https://issues.apache.org/jira/browse/PIG-3435 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-3435-v01.patch, > pig-3435-v02_skipcustompatitioner_for_merge.patch > > > When looking at PIG-3385, noticed some issues in handling of custom > partitioner with multi-query optimization. > {noformat} > C1 = group B1 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > {noformat} > This seems to be merged to one mapreduce job correctly but custom partitioner > information was lost. > {noformat} > C1 = group B1 by col1 PARTITION BY > org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 parallel 2; > {noformat} > This seems to be merged even though they should run on two different > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3440) MultiQuery to work with custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi reassigned PIG-3440: - Assignee: Koji Noguchi Taking a look. I'm not sure if we should limit the merging to jobs with same custom partitioners or if we can merge all jobs and have single partitioner that delegates to the corresponding partitioner based on inputs. (I'm still learning multi-query optimization. I may be off on this.) > MultiQuery to work with custom partitioner > -- > > Key: PIG-3440 > URL: https://issues.apache.org/jira/browse/PIG-3440 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Daniel Dai >Assignee: Koji Noguchi > > Currently Pig disable multiquery in case of custom partitioner in PIG-3435. > However, when custom partitioner are the same, we can still use multiquery. > This is the Jira ticket to track this optimization. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer
[ https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751999#comment-13751999 ] Koji Noguchi commented on PIG-3435: --- Thanks Daniel! Can we back-port this patch to 0.11? (That was one of the motivation for me to keep the patch simple.) I'll work on PIG-3440. > Custom Partitioner not working with MultiQueryOptimizer > --- > > Key: PIG-3435 > URL: https://issues.apache.org/jira/browse/PIG-3435 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.12 > > Attachments: pig-3435-v01.patch, > pig-3435-v02_skipcustompatitioner_for_merge.patch > > > When looking at PIG-3385, noticed some issues in handling of custom > partitioner with multi-query optimization. > {noformat} > C1 = group B1 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 PARTITION BY >org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > {noformat} > This seems to be merged to one mapreduce job correctly but custom partitioner > information was lost. > {noformat} > C1 = group B1 by col1 PARTITION BY > org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; > C2 = group B2 by col1 parallel 2; > {noformat} > This seems to be merged even though they should run on two different > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3385) DISTINCT no longer uses custom partitioner
[ https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752000#comment-13752000 ] Koji Noguchi commented on PIG-3385: --- Thanks Daniel! Can we back-port this patch and PIG-3435 to 0.11? Without them, custom partitioner is almost unusable. > DISTINCT no longer uses custom partitioner > -- > > Key: PIG-3385 > URL: https://issues.apache.org/jira/browse/PIG-3385 > Project: Pig > Issue Type: Bug > Components: impl >Reporter: Will Oberman >Assignee: Koji Noguchi >Priority: Minor > Fix For: 0.12 > > Attachments: pig-3385-v01.patch, pig-3385-v02.patch > > > From u...@pig.apache.org: It looks like an optimization was put in to make > distinct use a special partitioner which prevents the user from setting the > partitioner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755016#comment-13755016 ] Koji Noguchi commented on PIG-3293: --- I hit a worse case today. (1) Case I mentioned originally was with union between loaderA and loaderB in which both return the same loadCaster, Utf8StorageConverter. Typecast failing after the union. One I saw today. (2) Single Loader but with different argument resulting in a typecast error. {noformat} A = load 'data1' using LoaderA('col1') as (a:bytearray); B = load 'data1' using LoaderA('col2') as (a:bytearray); C = union ...; D = foreach C generate (chararray)a; store D ... {noformat} I wish I can simply check the classname of the loaders for the uniqueness of loadcaster. But then, I saw HBaseStorage returning different loadcaster depending on its input parameters. One other approach I'm thinking is, is it possible to push the typecast above the union so that we can perform loader.getLoadCaster().bytsToCharArray for each input to union ? > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Minor > Attachments: pig-3293-test-only-v01.patch > > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
dev@pig.apache.org
[ https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756712#comment-13756712 ] Koji Noguchi commented on PIG-3293: --- bq. Also improve the error message to indicate possible causes would help. I've updated the error message a bit in PIG-3295. However, it is still vague in that I cannot tell whether the failure was due to UDF with no loadcaster or Union with two different loaders. > Casting fails after Union from two data sources&loaders > --- > > Key: PIG-3293 > URL: https://issues.apache.org/jira/browse/PIG-3293 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Priority: Minor > Attachments: pig-3293-test-only-v01.patch > > > Script similar to > {noformat} > A = load 'data1' using MyLoader() as (a:bytearray); > B = load 'data2' as (a:bytearray); > C = union onschema A,B; > D = foreach C generate (chararray)a; > Store D into './out'; > {noformat} > fails with >java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: > ERROR 1075: Received a bytearray from the UDF. Cannot determine how to > convert the bytearray to string. > Both MyLoader and PigStorage use the default Utf8StorageConverter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3295: -- Attachment: pig-3295-v02.patch Just noticed my previous patch wasn't created with '--no-prefix' option. Reattaching. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch, pig-3295-v02.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3447) Compiler warning message dropped for CastLineageSetter and others with no enum kind
Koji Noguchi created PIG-3447: - Summary: Compiler warning message dropped for CastLineageSetter and others with no enum kind Key: PIG-3447 URL: https://issues.apache.org/jira/browse/PIG-3447 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Assignee: Koji Noguchi Priority: Trivial Following compiler warning was never shown to users for two reasons. {noformat} //./src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java 106 if(inLoadFunc == null){ 107 String msg = "Cannot resolve load function to use for casting from " + 108 DataType.findTypeName(inType) + " to " + 109 DataType.findTypeName(outType) + ". "; 110 msgCollector.collect(msg, MessageType.Warning); 111 } {noformat} # CompilationMessageCollector.logMessages or logAllMessages not being called after CastLineageSetter.visit. # CompilationMessageCollector.collect with no KIND don't print out any messages when aggregate.warning=true (default) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3447) Compiler warning message dropped for CastLineageSetter and others with no enum kind
[ https://issues.apache.org/jira/browse/PIG-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3447: -- Attachment: pig-3447-v01.txt With the patch, it'll print out {noformat} 2013-09-03 13:58:20,625 [main] WARN org.apache.pig.PigServer - Encountered Warning NO_LOAD_FUNCTION_FOR_CASTING_BYTEARRAY 1 time(s). {noformat} If anyone still calls CompilationMessageCollector.collect without enum KIND, then it'll at least print out {noformat} 2013-08-30 22:25:20,940 [main] WARN org.apache.pig.PigServer - Encountered Warning Aggregated unknown kind messages. Please set -Daggregate.warning=false to retrieve these messages 1 time(s). {noformat} Before, it wasn't printing out anything. With -Daggregate.warning=false, it'll print out the following (even without this patch). {noformat} 2013-09-03 14:24:48,275 [main] WARN org.apache.pig.PigServer - Cannot resolve load function to use for casting from bytearray to chararray. {noformat} > Compiler warning message dropped for CastLineageSetter and others with no > enum kind > --- > > Key: PIG-3447 > URL: https://issues.apache.org/jira/browse/PIG-3447 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Trivial > Attachments: pig-3447-v01.txt > > > Following compiler warning was never shown to users for two reasons. > {noformat} > //./src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java > 106 if(inLoadFunc == null){ > 107 String msg = "Cannot resolve load function to use for casting from > " + > 108 DataType.findTypeName(inType) + " to " + > 109 DataType.findTypeName(outType) + ". "; > 110 msgCollector.collect(msg, MessageType.Warning); > 111 } > {noformat} > # CompilationMessageCollector.logMessages or logAllMessages not being called > after CastLineageSetter.visit. > # CompilationMessageCollector.collect with no KIND don't print out any > messages when aggregate.warning=true (default) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2315) Make as clause work in generate
[ https://issues.apache.org/jira/browse/PIG-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757098#comment-13757098 ] Koji Noguchi commented on PIG-2315: --- > because it is not working anyway. > There's at least one case it's working for our users. {noformat} a = load 'input.txt' as (nb:bag{}); b = foreach a generate flatten(nb) as (year, name:bytearray); c = filter b by name == 'user1'; dump c; {noformat} Above case works. But without the ':bytearray' in relation b, it fails. {noformat} a = load 'input.txt' as (nb:bag{}); b = foreach a generate flatten(nb) as (year, name); c = filter b by name == 'user1'; dump c; {noformat} "Front End: ERROR 1052: Cannot cast bytearray to chararray" Please keep the first case valid. (Thanks [~fuding] for this example.) Error message in the second case is misleading that it's actually trying to typecast NULL to chararray. > Make as clause work in generate > --- > > Key: PIG-2315 > URL: https://issues.apache.org/jira/browse/PIG-2315 > Project: Pig > Issue Type: Bug >Reporter: Olga Natkovich >Assignee: Gianmarco De Francisci Morales > Fix For: 0.12 > > > Currently, the following syntax is supported and ignored causing confusing > with users: > A1 = foreach A1 generate a as a:chararray ; > After this statement a just retains its previous type -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759189#comment-13759189 ] Koji Noguchi commented on PIG-3295: --- bq. How about doing more aggressively by checking LoadCaster? That was my first approach, but as I also wrote in PIG-3293, "Figuring out if the loaders were same was easy with calling 'equals' for the FuncSpec. I don't know how to achieve this easily for comparing casters." > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch, pig-3295-v02.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759228#comment-13759228 ] Koji Noguchi commented on PIG-3295: --- bq. Can we instantiate the LoadFunc (with parameters) and then compare? Possible, but I can only compare the classnames? For the funcspec comparisons, we're comparing the classname as well as the parameters passed to the constructors. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch, pig-3295-v02.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760297#comment-13760297 ] Koji Noguchi commented on PIG-3295: --- bq. How about we make one exception in the case LoadCaster has only default construct? If you mean, make an exception only for Utf8StorageConverter, that makes sense since we have full control over the class and we know that classname check is sufficient for equality. Let me try coming up with a new patch. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch, pig-3295-v02.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)
[ https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3295: -- Attachment: pig-3295-v03.patch Attaching a patch which includes Daniel's suggestion on comparing the LoadCaster (and limiting to ones with default constructors only). Haven't run full test yet. > Casting from bytearray failing after Union (even when each field is from a > single Loader) > - > > Key: PIG-3295 > URL: https://issues.apache.org/jira/browse/PIG-3295 > Project: Pig > Issue Type: Bug > Components: parser >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3295-v01.patch, pig-3295-v02.patch, > pig-3295-v03.patch > > > One example > {noformat} > A = load 'data1.txt' as line:bytearray; > B = load 'c1.txt' using TextLoader() as cookie1; > C = load 'c2.txt' using TextLoader() as cookie2; > B2 = join A by line, B by cookie1; > C2 = join A by line, C by cookie2; > D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: > bytearray,C::cookie2: bytearray} > E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) > cookie2; > dump E; > {noformat} > This script fails at runtime with > "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: > Received a bytearray from the UDF. Cannot determine how to convert the > bytearray to string." > This is different from PIG-3293 such that each field in 'D' belongs to a > single loader whereas on PIG-3293, it came from multiple loader. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3458) ScalarExpression lost with multiquery optimization
Koji Noguchi created PIG-3458: - Summary: ScalarExpression lost with multiquery optimization Key: PIG-3458 URL: https://issues.apache.org/jira/browse/PIG-3458 Project: Pig Issue Type: Bug Reporter: Koji Noguchi Assignee: Koji Noguchi Our user reported an issue where their scalar results goes missing when having two store statements. {noformat} A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long); B = group A all; C = foreach B generate SUM(A.count) as total ; store C into 'deleteme6_C' using PigStorage(','); Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray ); Y = group Z by id; X = foreach Y generate group, C.total; store X into 'deleteme6_X' using PigStorage(','); Inputs pig> cat test1.txt a 1 b 2 c 8 d 9 pig> cat test2.txt a z b y c x pig> {noformat} Result X should contain the total count of '20' but instead it's empty. {noformat} pig> cat deleteme6_C/part-r-0 20 pig> cat deleteme6_X/part-r-0 x, y, z, pig> {noformat} This works if we take out first "store C" statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3458) ScalarExpression lost with multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765964#comment-13765964 ] Koji Noguchi commented on PIG-3458: --- Reason it gets lost is, we store C using PigStorage but ReadScalars tries to read it by a hardcoded InterStorage. {noformat} ... [First mapreduce job] Reduce Plan C: Store(/.../deleteme6_C:PigStorage(',')) - scope-17 | ... [Second mapreduce job] | POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[long] - scope-31 | | | |---Constant(0) - scope-29 | | | |---Constant(/.../deleteme6_C) - scope-30 {noformat} Trying to understand what the fix should be. 1. Make ReadScalars use the corresponding Loader. 2. Split relation 'C' so that we store them in both PigStorage AND InterStorage. I'm guessing latter, but appreciate your feedback. > ScalarExpression lost with multiquery optimization > -- > > Key: PIG-3458 > URL: https://issues.apache.org/jira/browse/PIG-3458 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi > > Our user reported an issue where their scalar results goes missing when > having two store statements. > {noformat} > A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long); > B = group A all; > C = foreach B generate SUM(A.count) as total ; > store C into 'deleteme6_C' using PigStorage(','); > Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray ); > Y = group Z by id; > X = foreach Y generate group, C.total; > store X into 'deleteme6_X' using PigStorage(','); > Inputs > pig> cat test1.txt > a 1 > b 2 > c 8 > d 9 > pig> cat test2.txt > a z > b y > c x > pig> > {noformat} > Result X should contain the total count of '20' but instead it's empty. > {noformat} > pig> cat deleteme6_C/part-r-0 > 20 > pig> cat deleteme6_X/part-r-0 > x, > y, > z, > pig> > {noformat} > This works if we take out first "store C" statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira