from:"Koji Noguchi \(JIRA\)"

[jira] [Created] (PIG-2061) NewPlan match() is sensitive to ordering

2011-05-11 Thread Koji Noguchi (JIRA)

NewPlan match() is sensitive to ordering


 Key: PIG-2061
 URL: https://issues.apache.org/jira/browse/PIG-2061
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Priority: Minor


There is no current Rule that is affected by this 
but inside TestNewPlanRule.java

{noformat}
155 public void testMultiNode() throws Exception {
...
175  pattern.connect(op1, op3);
176  pattern.connect(op2, op3);
...
178  Rule r = new SillyRule("basic", pattern);
179  List l = r.match(plan);
180  assertEquals(1, l.size());
{noformat}

but this test fail when we swap line 175 and 176 even though they are 
structurally equivalent.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2044) Patten match bug in org.apache.pig.newplan.optimizer.Rule

2011-05-12 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-2044:
--

Attachment: PIG-2044-00.patch

Taking out 'break' statement which made the for-loop meaningless.  Added one 
test. 

> Patten match bug in org.apache.pig.newplan.optimizer.Rule
> -
>
> Key: PIG-2044
> URL: https://issues.apache.org/jira/browse/PIG-2044
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Koji Noguchi
> Fix For: 0.10
>
> Attachments: PIG-2044-00.patch
>
>
> Koji find that we have a bug org.apache.pig.newplan.optimizer.Rule. The 
> "break" in line 179 seems to be wrong. This multiple branch matching is not 
> used in Pig, but could be a problem for the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2055) inconsistentcy behavior in parser generated during build

2011-05-12 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032783#comment-13032783
 ] 

Koji Noguchi commented on PIG-2055:
---

I hit this as well on my macbook.  It drove me crazy.
Using antlr-3.3 (instead of 3.2) seems to have fixed it for me.

> inconsistentcy behavior in parser generated during build 
> -
>
> Key: PIG-2055
> URL: https://issues.apache.org/jira/browse/PIG-2055
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Thejas M Nair
>
> On certain builds, i see that pig fails to support this syntax -
> {code}
> grunt> l = load 'x' using PigStorage(':');   
> 2011-05-10 09:21:41,565 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
> 1200:   mismatched input '(' expecting SEMI_COLON
> Details at logfile: /Users/tejas/pig_trunk_cp/trunk/pig_1305044484712.log
> {code}
> I seem to be the only one who has seen this behavior, and I have seen on 
> occassion when I build on mac. It could be problem with antlr and apple jvm 
> interaction. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-2044) Patten match bug in org.apache.pig.newplan.optimizer.Rule

2011-05-13 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-2044:
--

Status: Patch Available  (was: Open)

> Patten match bug in org.apache.pig.newplan.optimizer.Rule
> -
>
> Key: PIG-2044
> URL: https://issues.apache.org/jira/browse/PIG-2044
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Daniel Dai
>Assignee: Koji Noguchi
> Fix For: 0.10
>
> Attachments: PIG-2044-00.patch
>
>
> Koji find that we have a bug org.apache.pig.newplan.optimizer.Rule. The 
> "break" in line 179 seems to be wrong. This multiple branch matching is not 
> used in Pig, but could be a problem for the future. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-2802) Wrong Schema generated when there is a dangling alias

2012-12-12 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved PIG-2802.
---

Resolution: Duplicate

> Wrong Schema generated when there is a dangling alias
> -
>
> Key: PIG-2802
> URL: https://issues.apache.org/jira/browse/PIG-2802
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.0
>Reporter: Anitha Raju
>
> Hi,
> Script
> {code}
> A = load 'test.txt' using PigStorage() AS (x:int,y:int, z:int) ;
> B = GROUP A BY x;
> C = foreach B generate A.x as s;
> describe C; -- C: {s: {(x: int)}}
> D = FOREACH B {
>E = ORDER A by y;
>GENERATE A.x as s;
> };
> describe D; -- D: {x: int,y: int,z: int}
> {code}
> Here E is a dangling alias. 
> Regards,
> Anitha

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3051) java.lang.IndexOutOfBoundsException failure with LimitOptimizer + ColumnPruning

2012-12-20 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3051:
--

Attachment: pig-3051-v2.1-withe2etest.txt

Thanks Rohini for the review. 

bq .But found an issue with the copy not setting the label, type and Uid.

I wasn't sure why my test worked even when the above fields were not set. Turns 
out that they are filled by the SchemaPatcher.

LimitOptimizer.reportChanges() simply returns currentPlan, so SchemaPatcher 
goes through the entire currentPlan including the newSort.mSortColPlans 
mentioned and update them accordingly.

BTW, reading back my patch, I felt that logic of making a new copy of LOSort 
should be kept inside LOSort.java.  Uploading a new version.  Logic is same 
from the previous patch.

> java.lang.IndexOutOfBoundsException  failure with LimitOptimizer + 
> ColumnPruning
> 
>
> Key: PIG-3051
> URL: https://issues.apache.org/jira/browse/PIG-3051
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Fix For: 0.11
>
> Attachments: pig-3051-v1.1-withe2etest.txt, 
> pig-3051-v1-withouttest.txt, pig-3051-v2.1-withe2etest.txt
>
>
> Had a user hitting 
> "Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1" error 
> when he had multiple stores and limit in his code.
> I couldn't reproduce this with short pig code (due to ColumnPruning somehow 
> not happening when shortened), but here's a snippet. 
> {noformat}
> ...
> G3 = FOREACH G2 GENERATE sortCol, FLATTEN(group) as label, (long)COUNT(G1) as 
> cnt;
> G4 = ORDER G3 BY cnt DESC PARALLEL 25;
> ONEROW = LIMIT G4 1;
> U1 = FOREACH ONEROW GENERATE 3 as sortcol, 'somelabel' as label, cnt;
> store U1 into 'u1' using PigStorage();
> store G4 into 'g4' using PigStorage();
> {noformat}
> With '-t ColumnMapKeyPrune', job didn't hit the error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3100) If a .pig_schema file is present, can get an index out of bounds error

2012-12-20 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537287#comment-13537287
 ] 

Koji Noguchi commented on PIG-3100:
---

I should have commented on PIG-3056, but when our users hit this issue, those 
affected records tend to contain a record-separator as part of the data by 
mistake. And that resulted in a single record separated into two incomplete 
ones.

For that case, I wasn't sure if we wanted to fill the incomplete records with 
null or have an option like PIG-3059.

> If a .pig_schema file is present, can get an index out of bounds error
> --
>
> Key: PIG-3100
> URL: https://issues.apache.org/jira/browse/PIG-3100
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3100-0_nows.patch, PIG-3100-0.patch
>
>
> In the case that a .pig_schema file is present, if you have a record with 
> fewer than expected fields, pig errors out with an index out of bounds 
> exception that is annoying, unnecessary, and unhelpful.
> Instead of improving logging, I decided to just do what pig should do, which 
> is fill in the records.
> Patch will include a test and the fix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3100) If a .pig_schema file is present, can get an index out of bounds error

2012-12-20 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537302#comment-13537302
 ] 

Koji Noguchi commented on PIG-3100:
---

bq.  Perhaps there can be a flag or setting for PigStorage that is a "strict" 
mode
That sounds like a nice feature to have.  Come to think of it, problem of 
delimiters are not unique to this .pig_schema file loading. 


> If a .pig_schema file is present, can get an index out of bounds error
> --
>
> Key: PIG-3100
> URL: https://issues.apache.org/jira/browse/PIG-3100
> Project: Pig
>  Issue Type: Bug
>Reporter: Jonathan Coveney
>Assignee: Jonathan Coveney
> Fix For: 0.12
>
> Attachments: PIG-3100-0_nows.patch, PIG-3100-0.patch
>
>
> In the case that a .pig_schema file is present, if you have a record with 
> fewer than expected fields, pig errors out with an index out of bounds 
> exception that is annoying, unnecessary, and unhelpful.
> Instead of improving logging, I decided to just do what pig should do, which 
> is fill in the records.
> Patch will include a test and the fix.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3102) Option for PigStorage load to error out when input record is incomplete (instead of filling in null)

2012-12-20 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3102:
-

 Summary: Option for PigStorage load to error out when input record 
is incomplete (instead of filling in null)
 Key: PIG-3102
 URL: https://issues.apache.org/jira/browse/PIG-3102
 Project: Pig
  Issue Type: New Feature
Reporter: Koji Noguchi
Priority: Minor


Continuing from PIG-3100. 
If users know that all input records have correct number of fields, then 
enforcing that (with option) would let us catch any input corruption early.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"

2013-01-29 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3147:
-

 Summary: Spill failing with "java.lang.RuntimeException: 
InternalCachedBag.spill() should not be called"
 Key: PIG-3147
 URL: https://issues.apache.org/jira/browse/PIG-3147
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11
Reporter: Koji Noguchi
Priority: Blocker


Tried 0.11 jar with spilling, my job failed to spill with the following stack 
trace.  Anyone else seeing this?

{noformat}
java.lang.RuntimeException: InternalCachedBag.spill() should not be called
at 
org.apache.pig.data.InternalCachedBag.spill(InternalCachedBag.java:167)
at 
org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
at 
sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
at 
sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
at sun.management.Sensor.trigger(Sensor.java:120)
Exception in thread "Low Memory Detector" java.lang.InternalError: Error in 
invoking listener
at 
sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:141)
at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
at 
sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
at sun.management.Sensor.trigger(Sensor.java:120)
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"

2013-01-29 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565698#comment-13565698
 ] 

Koji Noguchi commented on PIG-3147:
---

Dumping a stacktrace when InternalCacheBag is added to the 
SpillableMemoryManager.spillables,

{noformat}
java.lang.Exception: Stack trace
at java.lang.Thread.dumpStack(Thread.java:1206)
at 
org.apache.pig.impl.util.SpillableMemoryManager.registerSpillable(SpillableMemoryManager.java:296)
at 
org.apache.pig.data.DefaultAbstractBag.markSpillableIfNecessary(DefaultAbstractBag.java:101)
at 
org.apache.pig.data.InternalCachedBag.addDone(InternalCachedBag.java:131)
at 
org.apache.pig.data.InternalCachedBag.iterator(InternalCachedBag.java:159)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:456)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POSortedDistinct.getNext(POSortedDistinct.java:62)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:432)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:581)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:228)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:282)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:416)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:348)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
{noformat}

Is this from a change in PIG-2923?


> Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() 
> should not be called"
> ---
>
> Key: PIG-3147
> URL: https://issues.apache.org/jira/browse/PIG-3147
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11
>Reporter: Koji Noguchi
>Priority: Blocker
>
> Tried 0.11 jar with spilling, my job failed to spill with the following stack 
> trace.  Anyone else seeing this?
> {noformat}
> java.lang.RuntimeException: InternalCachedBag.spill() should not be called
>   at 
> o

[jira] [Commented] (PIG-2923) Lazily register bags with SpillableMemoryManager

2013-01-29 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565705#comment-13565705
 ] 

Koji Noguchi commented on PIG-2923:
---

Hi Dmitriy, I'm seeing a weird error when pig 0.11 tries to spill. Can this 
change be related?  Opened PIG-3147.

> Lazily register bags with SpillableMemoryManager
> 
>
> Key: PIG-2923
> URL: https://issues.apache.org/jira/browse/PIG-2923
> Project: Pig
>  Issue Type: Improvement
>Reporter: Dmitriy V. Ryaboy
>Assignee: Dmitriy V. Ryaboy
> Fix For: 0.11
>
> Attachments: bagspill_delayed_register.patch, bagspill_delay.patch
>
>
> Currently, all Spillable DataBags get registered by the BagFactory at the 
> moment of creation. In practice, a lot of these bags will not get large 
> enough to be worth spilling; we can avoid a lot of memory overhead and 
> cheapen the process of finding a bag to spill when we do need it, by allowing 
> Bags themselves to register when they grow to some respectable threshold.
> Related JIRAs: PIG-2917, PIG-2918

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3147) Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() should not be called"

2013-01-29 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3147:
--

Attachment: pig-3147-v01.txt

Reading PIG-975, InternalCachedBag should not register with 
SpillableMemoryManager.

This is just a pure guess, but uploading a patch that takes out 
markSpillableIfNecessary from InternalCachedBag.java.

> Spill failing with "java.lang.RuntimeException: InternalCachedBag.spill() 
> should not be called"
> ---
>
> Key: PIG-3147
> URL: https://issues.apache.org/jira/browse/PIG-3147
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11
>Reporter: Koji Noguchi
>Priority: Blocker
> Attachments: pig-3147-v01.txt
>
>
> Tried 0.11 jar with spilling, my job failed to spill with the following stack 
> trace.  Anyone else seeing this?
> {noformat}
> java.lang.RuntimeException: InternalCachedBag.spill() should not be called
>   at 
> org.apache.pig.data.InternalCachedBag.spill(InternalCachedBag.java:167)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> Exception in thread "Low Memory Detector" java.lang.InternalError: Error in 
> invoking listener
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:141)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-01-29 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3148:
-

 Summary: OutOfMemory exception while spilling stale 
DefaultDataBag. Extra option to gc() before spilling large bag.
 Key: PIG-3148
 URL: https://issues.apache.org/jira/browse/PIG-3148
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Koji Noguchi
Assignee: Koji Noguchi


Our user reported that one of their jobs in pig 0.10 occasionally failed with 

'Error: GC overhead limit exceeded' or 'Error: Java heap space', but rerunning 
it sometimes finishes successfully.

For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag with 
300-400MBytes each when failing with OOM.

Jstack at the time of OOM always showed that spill was running.

{noformat}
"Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
[0xb9afc000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:260)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
- locked <0xe57c6390> (a java.io.BufferedOutputStream)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
- locked <0xe57c60b8> (a java.io.DataOutputStream)
at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
at 
org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
- locked <0xceb16190> (a java.util.ArrayList)
at 
org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
- locked <0xbeb86318> (a java.util.LinkedList)
at 
sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
at 
sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
at sun.management.Sensor.trigger(Sensor.java:120)
{noformat}



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-01-29 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13565975#comment-13565975
 ] 

Koji Noguchi commented on PIG-3148:
---

I cannot attach the original query, but to give you an idea

{noformat}
A = LOAD '$INPUT' USING MyLoader('\u0001') AS
( val1, val2, val3, val4, val5, val6, val7, val8, val9, val10, val11, val12, 
val13, val14, val15, val16);
B = GROUP A BY (val1, val2, val3, val8) PARALLEL $NUM_REDUCERS;
C = FOREACH B {
D = FILTER A BY (val3 == 'status1' AND val5 == 'status2');
E = D.val4;
F = DISTINCT E;
G = FILTER D BY val7 == 'status3';
GENERATE group.val1, group.val2, group.val8, COUNT(F), COUNT(G), SUM(G.val9), 
SUM(G.val10), SUM(G.val11), SUM(A.val12), SUM(A.val13), SUM(A.val15), 
SUM(A.val14), SUM(A.val16);
}

STORE C INTO '$OUTPUT' USING PigStorage('\u0001');
{noformat}

Assuming that this script does not require a two huge DefaultDataBag, looked 
into SpillableMemoryManager.handleNotification.   'handleNotification' is 
called whenever certain memory condition is met but not necessary after the 
gc().  

What was happening on this user's case was, 
(i)  400MB of DefaultDataBag#1 goes stale.
(ii) SpillableMemoryManager.handleNotification is triggered.
(iii) Since gc() is not called yet, WeakReference is still valid and pig 
decides to spill holding the lock of this DefaultDataBag#1.mContents(ArrayList).
(iv) While reduce task is working on another 400MB DefaultDataBag#2, jvm heap 
gets full and gc is called.  Even though no one is using the stale 
DefaultDataBag#1, it cannot be GC-ed since spill is holding the lock.

As a result, we end up with two DefaultDataBag leading to OOM.

> OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
> gc() before spilling large bag.
> --
>
> Key: PIG-3148
> URL: https://issues.apache.org/jira/browse/PIG-3148
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>
> Our user reported that one of their jobs in pig 0.10 occasionally failed with 
> 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
> rerunning it sometimes finishes successfully.
> For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
> with 300-400MBytes each when failing with OOM.
> Jstack at the time of OOM always showed that spill was running.
> {noformat}
> "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
> [0xb9afc000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   - locked <0xe57c6390> (a java.io.BufferedOutputStream)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   - locked <0xe57c60b8> (a java.io.DataOutputStream)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
>   at 
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
>   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
>   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
>   - locked <0xceb16190> (a java.util.ArrayList)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   - locked <0xbeb86318> (a java.util.LinkedList)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-01-29 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3148:
--

Attachment: pig-3148-v01.patch

Uploading a patch that adds a feature that would call System.gc() when 
Spillable is bigger than 'pig.spill.extragc.size.threshold' before spilling.  

This extra gc() is called at most once per handleNotification and also disabled 
as default since adding GC has a risk of changing the performance drastically.

For the job I was looking, adding 
'-Dpig.spill.extragc.size.threshold=1' let the job run successfully 
with no OOM errors.

(Note: Separate spill issues before this patch on 0.11 is tracked at PIG-3147.)

> OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
> gc() before spilling large bag.
> --
>
> Key: PIG-3148
> URL: https://issues.apache.org/jira/browse/PIG-3148
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3148-v01.patch
>
>
> Our user reported that one of their jobs in pig 0.10 occasionally failed with 
> 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
> rerunning it sometimes finishes successfully.
> For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
> with 300-400MBytes each when failing with OOM.
> Jstack at the time of OOM always showed that spill was running.
> {noformat}
> "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
> [0xb9afc000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   - locked <0xe57c6390> (a java.io.BufferedOutputStream)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   - locked <0xe57c60b8> (a java.io.DataOutputStream)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
>   at 
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
>   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
>   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
>   - locked <0xceb16190> (a java.util.ArrayList)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   - locked <0xbeb86318> (a java.util.LinkedList)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3178:
-

 Summary: Print a stacktrace when ExecutableManager hits an OOM
 Key: PIG-3178
 URL: https://issues.apache.org/jira/browse/PIG-3178
 Project: Pig
  Issue Type: Improvement
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Trivial


When looking at user's pig streaming failing with OOM, it only showed

2013-02-09 03:35:08,694 ERROR [Thread-14]
org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
Java heap space

It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Attachment: pig-3178-trunk-v01.patch

Adding printStackTrace call.  Since it's only a logging change, no test is 
being added.

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Status: Patch Available  (was: Open)

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575869#comment-13575869
 ] 

Koji Noguchi commented on PIG-3178:
---

bq. Doesn't LOG.error(t); log the stacktrace?

I thought it only prints out the cause.  After adding the printStackTrace call, 
log showed

{noformat}
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:122)
at org.apache.pig.builtin.PigStreaming.serialize(PigStreaming.java:76)
at org.apache.pig.impl.streaming.InputHandler.putNext(InputHandler.java:66)
at
org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run(ExecutableManager.java:367)
{noformat}

which helped me identify the killer record filling up the heap.


> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Status: Open  (was: Patch Available)

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Attachment: pig-3178-trunk-v02.patch

bq. Can you just add a message - LOG.error("Error running blah blah", t); - so 
that the stacktrace gets logged.

Ah, i see. Uploading. Confirmed that this also logs the stacktrace.  Now in 
syslog, even better.

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Status: Patch Available  (was: Open)

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3178) Print a stacktrace when ExecutableManager hits an OOM

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3178:
--

Attachment: pig-3178-trunk-v03.patch

Sorry for the spam. Rohini pointed out that last patch was based from 0.10 and 
not trunk.  Reloading.

> Print a stacktrace when ExecutableManager hits an OOM
> -
>
> Key: PIG-3178
> URL: https://issues.apache.org/jira/browse/PIG-3178
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3178-trunk-v01.patch, pig-3178-trunk-v02.patch, 
> pig-3178-trunk-v03.patch
>
>
> When looking at user's pig streaming failing with OOM, it only showed
> 2013-02-09 03:35:08,694 ERROR [Thread-14]
> org.apache.pig.impl.streaming.ExecutableManager: java.lang.OutOfMemoryError:
> Java heap space
> It would have been nice if it also showed the stack trace.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3179) Task Information Header only prints out the first split for each task

2013-02-11 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3179:
-

 Summary: Task Information Header only prints out the first split 
for each task
 Key: PIG-3179
 URL: https://issues.apache.org/jira/browse/PIG-3179
 Project: Pig
  Issue Type: Improvement
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Trivial


When a task's PigSplit is containing more than wrappedSplit, it only logs the 
first fileinfo.

When debugging, I saw 
{noformat}
= Task Information Header =
Command: bash 
Start time: Mon Feb 11 16:41:21 UTC 2013
Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2
Input-split start-offset: 0Input-split length: 11854247
{noformat}

but the actual error was happing while reading part-r-7.bz2.  It would have 
been nice if the log showed all the info that task was going to read.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3179:
--

Attachment: pig-3179-v01.patch

Added for-loop to print all the splits.

> Task Information Header only prints out the first split for each task
> -
>
> Key: PIG-3179
> URL: https://issues.apache.org/jira/browse/PIG-3179
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3179-v01.patch
>
>
> When a task's PigSplit is containing more than wrappedSplit, it only logs the 
> first fileinfo.
> When debugging, I saw 
> {noformat}
> = Task Information Header =
> Command: bash 
> Start time: Mon Feb 11 16:41:21 UTC 2013
> Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2
> Input-split start-offset: 0Input-split length: 11854247
> {noformat}
> but the actual error was happing while reading part-r-7.bz2.  It would 
> have been nice if the log showed all the info that task was going to read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3179:
--

Attachment: pig-3179-v02.patch

Changed based on Rohini's suggestion.
Added extra line printing out the number of input splits.

{noformat}
PigSplit contains 11 wrappedSplits.
Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00032.bz2 
start-offset=0 length=11814548
Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00033.bz2 
start-offset=0 length=11953088
Input-split: file=hdfs://abc.def.com:8020/tmp/hij/part-r-00034.bz2 
start-offset=0 length=12122182
Input-split: file=hdfs://abc.def...
...
{noformat}

> Task Information Header only prints out the first split for each task
> -
>
> Key: PIG-3179
> URL: https://issues.apache.org/jira/browse/PIG-3179
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3179-v01.patch, pig-3179-v02.patch
>
>
> When a task's PigSplit is containing more than wrappedSplit, it only logs the 
> first fileinfo.
> When debugging, I saw 
> {noformat}
> = Task Information Header =
> Command: bash 
> Start time: Mon Feb 11 16:41:21 UTC 2013
> Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2
> Input-split start-offset: 0Input-split length: 11854247
> {noformat}
> but the actual error was happing while reading part-r-7.bz2.  It would 
> have been nice if the log showed all the info that task was going to read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task

2013-02-11 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3179:
--

Attachment: pig-3179-v03.patch

bq. Hi Koji Noguchi, minor thing - calling toString on Path might be redundant?

Thanks Prashant.  Updated patch.

> Task Information Header only prints out the first split for each task
> -
>
> Key: PIG-3179
> URL: https://issues.apache.org/jira/browse/PIG-3179
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3179-v01.patch, pig-3179-v02.patch, 
> pig-3179-v03.patch
>
>
> When a task's PigSplit is containing more than wrappedSplit, it only logs the 
> first fileinfo.
> When debugging, I saw 
> {noformat}
> = Task Information Header =
> Command: bash 
> Start time: Mon Feb 11 16:41:21 UTC 2013
> Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2
> Input-split start-offset: 0Input-split length: 11854247
> {noformat}
> but the actual error was happing while reading part-r-7.bz2.  It would 
> have been nice if the log showed all the info that task was going to read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3179) Task Information Header only prints out the first split for each task

2013-02-13 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3179:
--

Attachment: pig-3179-v04.patch

bq. Can you reuse the StringBuilder. i.e move StringBuilder sb = new 
StringBuilder(); outside of the loop and inside the loop set sb.setLength(0);

Attaching an updated patch.

> Task Information Header only prints out the first split for each task
> -
>
> Key: PIG-3179
> URL: https://issues.apache.org/jira/browse/PIG-3179
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3179-v01.patch, pig-3179-v02.patch, 
> pig-3179-v03.patch, pig-3179-v04.patch
>
>
> When a task's PigSplit is containing more than wrappedSplit, it only logs the 
> first fileinfo.
> When debugging, I saw 
> {noformat}
> = Task Information Header =
> Command: bash 
> Start time: Mon Feb 11 16:41:21 UTC 2013
> Input-split file: hdfs://abc.bcd.efg:8020/tmp/hij/part-r-0.bz2
> Input-split start-offset: 0Input-split length: 11854247
> {noformat}
> but the actual error was happing while reading part-r-7.bz2.  It would 
> have been nice if the log showed all the info that task was going to read.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-02-13 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577743#comment-13577743
 ] 

Koji Noguchi commented on PIG-3148:
---

Rohini asked me to clarify why I'm adding extra param instead of simply calling 
gc() at the top of handleNotification(). 

Reason I added extra param is,
* When I tried just adding gc() at the top, suddenly I saw all of my mappers 
stuck, spending 99% of cputime on gc.  I then learned that handleNotification 
is called much more frequently than I first anticipated when the application is 
using more than the threshold and have nothing much to spill.  That convinced 
me to add more condition to reduce the gc() calls.  
* Motivation of my patch here is to avoid OutOfMemory when application is 
holding a reference to a large stale bag while spilling unnecessarily. For 
that, bag being spilled has to be proportional to the heap size of the 
application to cause OOM.


> OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
> gc() before spilling large bag.
> --
>
> Key: PIG-3148
> URL: https://issues.apache.org/jira/browse/PIG-3148
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3148-v01.patch
>
>
> Our user reported that one of their jobs in pig 0.10 occasionally failed with 
> 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
> rerunning it sometimes finishes successfully.
> For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
> with 300-400MBytes each when failing with OOM.
> Jstack at the time of OOM always showed that spill was running.
> {noformat}
> "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
> [0xb9afc000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   - locked <0xe57c6390> (a java.io.BufferedOutputStream)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   - locked <0xe57c60b8> (a java.io.DataOutputStream)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
>   at 
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
>   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
>   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
>   - locked <0xceb16190> (a java.util.ArrayList)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   - locked <0xbeb86318> (a java.util.LinkedList)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-02-14 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13578507#comment-13578507
 ] 

Koji Noguchi commented on PIG-3148:
---

Thanks Dmitriy, Rohini!  I like the fixed ratio suggestion.  Is 5% ok?  Maybe 
10%?
Also, do we still want a configurable flag to enable this feature?

> OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
> gc() before spilling large bag.
> --
>
> Key: PIG-3148
> URL: https://issues.apache.org/jira/browse/PIG-3148
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3148-v01.patch
>
>
> Our user reported that one of their jobs in pig 0.10 occasionally failed with 
> 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
> rerunning it sometimes finishes successfully.
> For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
> with 300-400MBytes each when failing with OOM.
> Jstack at the time of OOM always showed that spill was running.
> {noformat}
> "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
> [0xb9afc000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   - locked <0xe57c6390> (a java.io.BufferedOutputStream)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   - locked <0xe57c60b8> (a java.io.DataOutputStream)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
>   at 
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
>   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
>   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
>   - locked <0xceb16190> (a java.util.ArrayList)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   - locked <0xbeb86318> (a java.util.LinkedList)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-03-01 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3148:
--

Attachment: pig-3148-v02.patch

Sorry for the delay.  Attaching a patch with suggested change.

> OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
> gc() before spilling large bag.
> --
>
> Key: PIG-3148
> URL: https://issues.apache.org/jira/browse/PIG-3148
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3148-v01.patch, pig-3148-v02.patch
>
>
> Our user reported that one of their jobs in pig 0.10 occasionally failed with 
> 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
> rerunning it sometimes finishes successfully.
> For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
> with 300-400MBytes each when failing with OOM.
> Jstack at the time of OOM always showed that spill was running.
> {noformat}
> "Low Memory Detector" daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
> [0xb9afc000]
>java.lang.Thread.State: RUNNABLE
>   at java.io.FileOutputStream.writeBytes(Native Method)
>   at java.io.FileOutputStream.write(FileOutputStream.java:260)
>   at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>   - locked <0xe57c6390> (a java.io.BufferedOutputStream)
>   at java.io.DataOutputStream.write(DataOutputStream.java:90)
>   - locked <0xe57c60b8> (a java.io.DataOutputStream)
>   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
>   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
>   at 
> org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
>   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
>   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
>   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
>   - locked <0xceb16190> (a java.util.ArrayList)
>   at 
> org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
>   - locked <0xbeb86318> (a java.util.LinkedList)
>   at 
> sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
>   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
>   at 
> sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
>   at sun.management.Sensor.trigger(Sensor.java:120)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2597) Move grunt from javacc to ANTRL

2013-03-15 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603583#comment-13603583
 ] 

Koji Noguchi commented on PIG-2597:
---

bq. Jonathan, any update on this?
I'm interested in this status as well.  
Does Boski have a plan to continue working on this?


> Move grunt from javacc to ANTRL
> ---
>
> Key: PIG-2597
> URL: https://issues.apache.org/jira/browse/PIG-2597
> Project: Pig
>  Issue Type: Improvement
>Reporter: Jonathan Coveney
>  Labels: GSoC2012
> Attachments: pig02.diff
>
>
> Currently, the parser for queries is in ANTLR, but Grunt is still javacc. The 
> parser is very difficult to work with, and next to impossible to understand 
> or modify. ANTLR provides a much cleaner, more standard way to generate 
> parsers/lexers/ASTs/etc, and moving from javacc to Grunt would be huge as we 
> continue to add features to Pig.
> This is a candidate project for Google summer of code 2012. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2012

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-18 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3251:
-

 Summary: Bzip2TextInputFormat requires double the memory of 
maximum record size
 Key: PIG-3251
 URL: https://issues.apache.org/jira/browse/PIG-3251
 Project: Pig
  Issue Type: Improvement
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Minor


While looking at user's OOM heap dump, noticed that pig's Bzip2TextInputFormat 
consumes memory at both

Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
and actual Text that is returned as line.

For example, when having one record with 160MBytes, buffer was 268MBytes and 
Text was 160MBytes.  

We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-18 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v01.patch

In Bzip2TextInputFormat, it says

{code}
/**
 * Provide a bridge to get the bytes from the ByteArrayOutputStream 
without
 * creating a new byte array.
 */
private static class TextStuffer extends OutputStream {

{code}
However, in reality, Text just creates a new bytearray and copy the content.

Attaching a patch that is similar to the approach taken by 
org.apache.hadoop.util.LineReader but with less changes since HADOOP-4012(added 
in 0.21) was a huge patch.

This patch basically reads into the fixed-length-buffer and appends to Text 
whenever it gets full.

Touching BZip2LineRecordReader makes me nervous so I wanted the changes to be 
small.

I need to do more testings to see if this approach works or not.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-19 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606445#comment-13606445
 ] 

Koji Noguchi commented on PIG-3251:
---

bq.  Let me know if you find any problem in your testing.

Thanks Daniel.  My initial test went well on 0.23 cluster.  It was as fast as 
the original and requiring less memory.  
However, the patched pig is super slow on 1.0.2 cluster.

Reason is, I'm using the Text directly as the replacement of 
ByteArrayOutputStream.  Without HADOOP-6109 which was committed in 0.21, Text 
grows linearly whereas ByteArrayOutputStream grows exponentially requiring a 
lot more copies for the former.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v02.patch

(1) 
Current status (before any patch)
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY. 
This Jira. |  
| 0.23 | [ii] Good.   |  (iv) Needs EXTRA MEMORY. 
This Jira. | 

(2) 
My initial patch (pig-3251-trunk-v01.patch) changes this to 
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Slow due to 
HADOOP-6109 |  
| 0.23 | [ii] Good.   |  (iv) Good | 

(3) 
If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves 
all the problem.
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20+Hadoop-6109 | [i]  Good| (iii) Good |  
| 0.23 | [ii] Good.   |  (iv) Good | 

However, I've seen a discussion about pig supporting 0.20.2 users.  
So I guess we can't ask them to backport HADOOP-6109 then.


I think my remaining options are
(a) Give up.  Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109 
to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*. 
(b) Try to workaround without touching hadoop code.

I think (a) is reasonable but tried (b).  This patch makes the status as below.

(4) 
Patch (pig-3251-trunk-v02.patch) 
||hadoop version  || PigTextInputFormat  || Bzip2TextInputFormat.java 
|| 
| 0.20 | [i]  SLOW due to HADOOP-6109 | (iii) Good |  
| 0.23 | [ii] Good.   |  (iv) Good | 


Penalty of not touching the hadoop code is, my patch adds two unnecessary 
bytearray copies when extending the Text size.  But frequency is low due to 
exponentially increasing sizes, so I hope the overall overhead is negligible.


> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607860#comment-13607860
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use 
PigTextInputFormat?
That'll (almost) have the same effect of my initial patch 
pig-3251-trunk-v01.patch which takes to status (2) in my previous comment.  
With HADOOP-7823 + HADOOP-6109, then it'll be (3).
Without a doubt, HADOOP-7823 + HADOOP-6109 is the cleanest approach.



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607886#comment-13607886
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use 
PigTextInputFormat?
Since our platform has moved to 0.23, I'll be happy if we can simply remove 
Bzip2TextInputFormat just for hadoop 0.23 or later.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-03-20 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v03.patch

bq. Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use 
Bzip2TextInputFormat otherwise for backward compatibility.

Would something like this work? pig-3251-trunk-v03.patch uses 
PigTextInputFormat even for bzip if TextInputFormat can split them. (I'll 
update the other FileInputLoadFunc if this change looks ok.  Also, this works 
with 'bz2' extension but not for 'bz' unless config is added.)



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3255) Avoid extra byte array copy in streaming deserialize

2013-03-20 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608227#comment-13608227
 ] 

Koji Noguchi commented on PIG-3255:
---

+1 Looks good to me. 
Probably another Jira, but I wonder if we really need to create new Text for 
every streaming outputs.  Can we reuse it with value.clear() ?
(But if we do this, then in most cases value.getBytes().length <> 
value.getLength().)

> Avoid extra byte array copy in streaming deserialize
> 
>
> Key: PIG-3255
> URL: https://issues.apache.org/jira/browse/PIG-3255
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
> Fix For: 0.12
>
> Attachments: PIG-3255-1.patch
>
>
> PigStreaming.java:
>  public Tuple deserialize(byte[] bytes) throws IOException {
> Text val = new Text(bytes);  
> return StorageUtil.textToTuple(val, fieldDel);
> }
> Should remove new Text(bytes) copy and construct the tuple directly from the 
> bytes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-02 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3266:
-

 Summary: Pig takes forever to parse scripts with foreach + multi 
level binconds 
 Key: PIG-3266
 URL: https://issues.apache.org/jira/browse/PIG-3266
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11, 0.10.0
Reporter: Koji Noguchi


Following pig script parsing takes 
*  1 second  in pig-0.8
* 90 seconds in pig-0.9
* forever in pig-0.10 (it's taking literally hours)

{noformat}
A = load 'input.txt' as (mynum:float, mychar:chararray);
B = foreach A generate mychar,
(mynum < 0 ? 0 :
(mynum < 1 ? 1 :
(mynum < 2 ? 2 :
(mynum < 3 ? 3 :
(mynum < 4 ? 4 :
(mynum < 5 ? 5 :
(mynum < 6 ? 6 :
(mynum < 7 ? 7 :
(mynum < 8 ? 8 :
(mynum < 9 ? 9 :
(mynum < 10 ? 10 :
(mynum < 11 ? 11 :
(mynum < 12 ? 12 :
(mynum < 13 ? 13 :
(mynum < 14 ? 14 :
(mynum < 15 ? 15 :
(mynum < 16 ? 16 :
(mynum < 17 ? 17 :
(mynum < 18 ? 18 :
(mynum < 19 ? 19 :
(mynum < 20 ? 20 : 21);
dump A;
{noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-02 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13620098#comment-13620098
 ] 

Koji Noguchi commented on PIG-3266:
---

If I revert the change from PIG:1387, parsing speed comes back to 90 seconds 
(pig-0.9 level)

{noformat}
src/org/apache/pig/parser/QueryParser.g
-projectable_expr: func_eval | col_ref | bin_expr | type_conversion
+projectable_expr: func_eval | col_ref | bin_expr
{noformat}

I don't know anything about antlr, but I guess it cannot tell whether the given 
tokens are bin_expr or type_conversion when starting with '(' so spending extra 
cycles to check both.

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-02 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13620158#comment-13620158
 ] 

Koji Noguchi commented on PIG-3266:
---

bq. Does it finish in the end, or never?
I would guess it'll finish but I don't know.  It has been running for 4 hours 
now.

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621115#comment-13621115
 ] 

Koji Noguchi commented on PIG-3266:
---

> > Does it finish in the end, or never?
> I would guess it'll finish but I don't know. It has been running for 4 hours 
> now.
>
I had to kill it after 28 hours of never-ending parsing...

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3261) User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not appended

2013-04-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621155#comment-13621155
 ] 

Koji Noguchi commented on PIG-3261:
---

I prefer with PIG_USER_CLASSPATH_FIRST.  I've seen too many random users 
including old pig jar in their custom UDFs...

In our environment, we perform QE on set of frameworks. (hadoop, pig, oozie, 
etc)
And we tell our users, whenever they set HADOOP_USER_CLASSPATH_FIRST they are 
running outside of the QA-ed environment.  I want the same to apply within pig 
with PIG_USER_CLASSPATH_FIRST.



> User set PIG_CLASSPATH entries must be prepended to the CLASSPATH, not 
> appended
> ---
>
> Key: PIG-3261
> URL: https://issues.apache.org/jira/browse/PIG-3261
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.10.0
>Reporter: Harsh J
>Assignee: Harsh J
> Attachments: PIG-3261.patch, PIG-3261.patch
>
>
> Currently we are doing this wrong:
> {code}
> if [ "$PIG_CLASSPATH" != "" ]; then
> CLASSPATH=${CLASSPATH}:${PIG_CLASSPATH}
> {code}
> This means that anything added to CLASSPATH until that point will never be 
> able to get overridden by a user set environment, which is wrong behavior. 
> Hadoop libs for example are added to CLASSPATH, before this extension is 
> called in bin/pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621178#comment-13621178
 ] 

Koji Noguchi commented on PIG-3266:
---

bq. I assume there is an infinite loop. Next time could you do a jstack before 
killing pig process and attach it here for the record?

A bit confused. I can certainly do that, but are you saying you cannot 
reproduce this issue on your side using my test script?  If so, I need to look 
at my test environment more carefully.

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621230#comment-13621230
 ] 

Koji Noguchi commented on PIG-3266:
---

bq. Koji Noguchi, I think this was fixed. I don't see the issue on trunk.

Just realize that. Thanks!  Can you show me which jira fixed this?
I should have tested with trunk before creating this jira.  I think I even 
tried with pig-0.11 to confirm the problem.

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-03 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi resolved PIG-3266.
---

  Resolution: Duplicate
Release Note: Found it. This is a duplicate of PIG-2769.   Sorry Xuefu for 
wasting your time on this!

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3266) Pig takes forever to parse scripts with foreach + multi level binconds

2013-04-03 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3266:
--

Release Note:   (was: Found it. This is a duplicate of PIG-2769.   Sorry 
Xuefu for wasting your time on this!)

> Pig takes forever to parse scripts with foreach + multi level binconds 
> ---
>
> Key: PIG-3266
> URL: https://issues.apache.org/jira/browse/PIG-3266
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11
>Reporter: Koji Noguchi
>
> Following pig script parsing takes 
> *  1 second  in pig-0.8
> * 90 seconds in pig-0.9
> * forever in pig-0.10 (it's taking literally hours)
> {noformat}
> A = load 'input.txt' as (mynum:float, mychar:chararray);
> B = foreach A generate mychar,
> (mynum < 0 ? 0 :
> (mynum < 1 ? 1 :
> (mynum < 2 ? 2 :
> (mynum < 3 ? 3 :
> (mynum < 4 ? 4 :
> (mynum < 5 ? 5 :
> (mynum < 6 ? 6 :
> (mynum < 7 ? 7 :
> (mynum < 8 ? 8 :
> (mynum < 9 ? 9 :
> (mynum < 10 ? 10 :
> (mynum < 11 ? 11 :
> (mynum < 12 ? 12 :
> (mynum < 13 ? 13 :
> (mynum < 14 ? 14 :
> (mynum < 15 ? 15 :
> (mynum < 16 ? 16 :
> (mynum < 17 ? 17 :
> (mynum < 18 ? 18 :
> (mynum < 19 ? 19 :
> (mynum < 20 ? 20 : 21);
> dump A;
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2769) a simple logic causes very long compiling time on pig 0.10.0

2013-04-04 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13622731#comment-13622731
 ] 

Koji Noguchi commented on PIG-2769:
---

bq. We should put this into 0.11 branch, maybe there will be an 0.11.2 before 
12 comes out.

If we can fix this in 0.11, that would be really nice.  
On our clusters, there were multiple users hit with this issue on 0.10.


> a simple logic causes very long compiling time on pig 0.10.0
> 
>
> Key: PIG-2769
> URL: https://issues.apache.org/jira/browse/PIG-2769
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.10.0
> Environment: Apache Pig version 0.10.0-SNAPSHOT (rexported)
>Reporter: Dan Li
>Assignee: Nick White
> Fix For: 0.12
>
> Attachments: case1.tar, PIG-2769.0.patch, PIG-2769.1.patch, 
> PIG-2769.2.patch, 
> TEST-org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.TestInputSizeReducerEstimator.txt
>
>
> We found the following simple logic will cause very long compiling time for 
> pig 0.10.0, while using pig 0.8.1, everything is fine.
> A = load 'A.txt' using PigStorage()  AS (m: int);
> B = FOREACH A {
> days_str = (chararray)
> (m == 1 ? 31: 
> (m == 2 ? 28: 
> (m == 3 ? 31: 
> (m == 4 ? 30: 
> (m == 5 ? 31: 
> (m == 6 ? 30: 
> (m == 7 ? 31: 
> (m == 8 ? 31: 
> (m == 9 ? 30: 
> (m == 10 ? 31: 
> (m == 11 ? 30:31)));
> GENERATE
>days_str as days_str;
> }   
> store B into 'B';
> and here's a simple input file example: A.txt
> 1
> 2
> 3
> The pig version we used in the test
> Apache Pig version 0.10.0-SNAPSHOT (rexported)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3270) Union onschema failing at runtime when merging incompatible types

2013-04-09 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3270:
-

 Summary: Union onschema failing at runtime when merging 
incompatible types
 Key: PIG-3270
 URL: https://issues.apache.org/jira/browse/PIG-3270
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi


{noformat}
t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
tout = UNION ONSCHEMA t1, t2;
dump tout;
{noformat}

Job fails with 
2013-04-09 11:37:37,817 [Thread-12] WARN  
org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
ERROR 2055: Received Error while processing the map plan.
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: 
Received Error while processing the map plan.
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:680)



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3270) Union onschema failing at runtime when merging incompatible types

2013-04-09 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626733#comment-13626733
 ] 

Koji Noguchi commented on PIG-3270:
---

Before PIG-2071, this job would have dumped field 'b' as chararray instead of 
failing at the middle at runtime.  Reading that jira, I'm thinking this example 
should have failed at compile time with better error messages.  Am I 
understanding it correctly? 

> Union onschema failing at runtime when merging incompatible types
> -
>
> Key: PIG-3270
> URL: https://issues.apache.org/jira/browse/PIG-3270
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>
> {noformat}
> t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
> t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
> tout = UNION ONSCHEMA t1, t2;
> dump tout;
> {noformat}
> Job fails with 
> 2013-04-09 11:37:37,817 [Thread-12] WARN  
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 2055: Received Error while processing the map plan.
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: 
> Received Error while processing the map plan.
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:680)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3271) POSplit ignoring error from input processing giving empty results

2013-04-09 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3271:
-

 Summary: POSplit ignoring error from input processing giving empty 
results 
 Key: PIG-3271
 URL: https://issues.apache.org/jira/browse/PIG-3271
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Priority: Critical


Script below fails at union onschema due to PIG-3270 but pig ignores its error 
and creates empty outputs with return code 0 (SUCCESS).
{noformat}
t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
tout = UNION ONSCHEMA t1, t2;
STORE tout INTO './out1' USING PigStorage();
STORE tout INTO './out2' USING PigStorage();
{noformat}

Is POSplit ignoring the error from input processing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3271) POSplit ignoring error from input processing giving empty results

2013-04-09 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3271:
--

Attachment: pig-3271-v01.patch

I'm having hard time tracking the code but this seems to catch the error.

> POSplit ignoring error from input processing giving empty results 
> --
>
> Key: PIG-3271
> URL: https://issues.apache.org/jira/browse/PIG-3271
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Critical
> Attachments: pig-3271-v01.patch
>
>
> Script below fails at union onschema due to PIG-3270 but pig ignores its 
> error and creates empty outputs with return code 0 (SUCCESS).
> {noformat}
> t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
> t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
> tout = UNION ONSCHEMA t1, t2;
> STORE tout INTO './out1' USING PigStorage();
> STORE tout INTO './out2' USING PigStorage();
> {noformat}
> Is POSplit ignoring the error from input processing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3271) POSplit ignoring error from input processing giving empty results

2013-04-10 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3271:
--

Attachment: pig-3271-v02.patch

bq. Ready to go with a testcase.

I'm lost in this.
Original example pasted on the jira failed due to PIG-3270 and should be fixed 
thus cannot be used for the testcase for this PIG-3271.

Just to show how lost I am, created a test case that connects couple of 
operators to force the input processing to fail. (I need a testcase that 
doesn't throw an exception but returns POStatus.STATUS_ERR.)


> POSplit ignoring error from input processing giving empty results 
> --
>
> Key: PIG-3271
> URL: https://issues.apache.org/jira/browse/PIG-3271
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Critical
> Attachments: pig-3271-v01.patch, pig-3271-v02.patch
>
>
> Script below fails at union onschema due to PIG-3270 but pig ignores its 
> error and creates empty outputs with return code 0 (SUCCESS).
> {noformat}
> t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
> t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
> tout = UNION ONSCHEMA t1, t2;
> STORE tout INTO './out1' USING PigStorage();
> STORE tout INTO './out2' USING PigStorage();
> {noformat}
> Is POSplit ignoring the error from input processing?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3270) Union onschema failing at runtime when merging incompatible types

2013-04-10 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3270:
--

Attachment: pig-3270-v01.patch

bq. We should not insert cast to bytes operation. It's probably in 
UnionOnSchemaSetter

Ah, I see.  I saw that job was failing at POCast(DataByteArray) but didn't know 
that it would work without this cast. 

Writing a test.

> Union onschema failing at runtime when merging incompatible types
> -
>
> Key: PIG-3270
> URL: https://issues.apache.org/jira/browse/PIG-3270
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
> Attachments: pig-3270-v01.patch
>
>
> {noformat}
> t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
> t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
> tout = UNION ONSCHEMA t1, t2;
> dump tout;
> {noformat}
> Job fails with 
> 2013-04-09 11:37:37,817 [Thread-12] WARN  
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 2055: Received Error while processing the map plan.
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: 
> Received Error while processing the map plan.
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:680)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3270) Union onschema failing at runtime when merging incompatible types

2013-04-10 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3270:
--

Attachment: pig-3270-v02.patch

bq. Writing a test.

Since the original job was failing at runtime due to invalid bytearray casting, 
I added a e2e test.

> Union onschema failing at runtime when merging incompatible types
> -
>
> Key: PIG-3270
> URL: https://issues.apache.org/jira/browse/PIG-3270
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
> Attachments: pig-3270-v01.patch, pig-3270-v02.patch
>
>
> {noformat}
> t1 = LOAD 'file1.txt' USING PigStorage() AS (a: chararray, b: chararray);
> t2 = LOAD 'file2.txt' USING PigStorage() AS (a: chararray, b: float);
> tout = UNION ONSCHEMA t1, t2;
> dump tout;
> {noformat}
> Job fails with 
> 2013-04-09 11:37:37,817 [Thread-12] WARN  
> org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
> java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 2055: Received Error while processing the map plan.
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:399)
> Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: 
> Received Error while processing the map plan.
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:311)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:726)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:231)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:680)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-04-24 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3293:
-

 Summary: Casting fails after Union from two data sources&loaders
 Key: PIG-3293
 URL: https://issues.apache.org/jira/browse/PIG-3293
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi


Script similar to 
{noformat}
A = load 'data1' using MyLoader() as (a:bytearray);
B = load 'data2' as (a:bytearray);
C = union onschema A,B;
D = foreach C generate (chararray)a;
Store D into './out';
{noformat}
fails with 
   java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
ERROR 1075: Received a bytearray from the UDF. Cannot determine how to convert 
the bytearray to string.

Both MyLoader and PigStorage use the default Utf8StorageConverter.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-04-24 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640568#comment-13640568
 ] 

Koji Noguchi commented on PIG-3293:
---

When two inputs are loaded by the same loader, this was handled at PIG-2493.

In the case here, I can understand 'funcSpec' would be null for Union/Cast 
since they are coming from two loaders, but can we still use the caster if both 
loaders happen to have the same one (Utf8StorageConverter)? 



> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-04-24 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3293:
--

Priority: Minor  (was: Major)

> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-04-24 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13640810#comment-13640810
 ] 

Koji Noguchi commented on PIG-3293:
---

I may have simplified my user's issue a bit.  What I was originally looking at 
was relation A and B being the join of two input sets and then 'union'ed 
together.  
So each field from Union was still coming from a single loader but the cast was 
still failing.  I'll create a separate jira for this since it's an easier fix.

For this jira, may I update the error message to suggest typecasting before the 
union? 
" ERROR 1075: Received a bytearray from the UDF. " is clearly wrong since UDF 
is not involved in this script.

> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-04-24 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3295:
-

 Summary: Casting from bytearray failing after Union (even when 
each field is from a single Loader)
 Key: PIG-3295
 URL: https://issues.apache.org/jira/browse/PIG-3295
 Project: Pig
  Issue Type: Bug
  Components: parser
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Minor


One example
{noformat}
A = load 'data1.txt' as line:bytearray;
B = load 'c1.txt' using TextLoader() as cookie1;
C = load 'c2.txt' using TextLoader() as cookie2;
B2 = join A by line, B by cookie1;
C2 = join A by line, C by cookie2;
D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
bytearray,C::cookie2: bytearray}
E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
cookie2;
dump E;
{noformat}

This script fails at runtime with 
"Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
Received a bytearray from the UDF. Cannot determine how to convert the 
bytearray to string."

This is different from PIG-3293 such that each field in 'D' belongs to a single 
loader whereas on PIG-3293, it came from multiple loader.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-04-25 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3295:
--

Attachment: pig-3295-v01.patch

Attaching an initial patch.
Instead of having one FuncSpec per LOUnion (PIG-2493), checking each field and 
setting different FuncSpec when possible.

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-04-25 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13642170#comment-13642170
 ] 

Koji Noguchi commented on PIG-3295:
---

Forgot to mention, I didn't fix PIG-3293 case but updated the error message to 
indicate it could be from Union with multiple loaders.  

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-04-29 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3295:
--

Status: Patch Available  (was: Open)

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-02 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v04.patch

bq. Also, this works with 'bz2' extension but not for 'bz' unless config is 
added.)

[~rohini] pointed out to me that it's not configurable.  My bad.  To keep the 
backward compatibility, added a wrapper codec that uses 'bz' as extension. 

As for selecting the InputFormat, I can also use hadoopShim and return 
PigTextInputFormat just for 0.23.

Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of 
having native codec. (HADOOP-8462)

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648464#comment-13648464
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of 
having native codec. (HADOOP-8462)

Learned that bzip native codec so far does not support splitting (and falls 
back to java version for splits).

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--

Attachment: pig-3251-trunk-v05.patch

Thanks Daniel.

bq.is the patch ready?
Ah, forgot to flag it as patch available.

bq. can we just cache splittable?
Makes complete sense.  Changing.

bq. Is it possible to wrap a codec deal with both bz2/bz? 
As far as I understand, hadoop has 1-to-1 mapping for the codec and extension. 
I don't know of a way to map multiple extensions to one codec.
Or, are you suggesting I create two silly wrappers instead of one? 



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648745#comment-13648745
 ] 

Koji Noguchi commented on PIG-3251:
---

FYI, couple of tests from TestBZip are failing after applying my patch.  
Looking.

> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2013-05-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648788#comment-13648788
 ] 

Koji Noguchi commented on PIG-3251:
---

bq. FYI, couple of tests from TestBZip are failing after applying my patch. 
Looking.

3 tests failed.  
{noformat}
Testcase: testBZ2Concatenation took 38.266 sec
  FAILED
Expected exception: java.io.IOException
junit.framework.AssertionFailedError: Expected exception: java.io.IOException

Testcase: testBlockHeaderEndingWithCR took 49.539 sec
  FAILED
expected:<82094> but was:<82093>
junit.framework.AssertionFailedError: expected:<82094> but was:<82093>
  at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256)
  at
org.apache.pig.test.TestBZip.testBlockHeaderEndingWithCR(TestBZip.java:112)

Testcase: testBlockHeaderEndingAtSplitNotByteAligned took 48.996 sec
  FAILED
expected:<74999> but was:<101591>
junit.framework.AssertionFailedError: expected:<74999> but was:<101591>
  at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256)
  at
org.apache.pig.test.TestBZip.testBlockHeaderEndingAtSplitNotByteAligned(TestBZip.java:88)
{noformat}

"testBZ2Concatenation" is expected since hadoop bzip2 codec handles 
concatenated bzip files (whereas pig's TestBZip is testing whether it reliably 
fails).
Other two are worrisome to me.  Asking my colleague to check.  It'll take some 
time.  Depending on what we find, we may need to change the condition for using 
hadoop's bzip codec.


> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-05-06 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3293:
--

Attachment: pig-3293-test-only-v01.patch

bq. Must be the "caster" in D's POCast is null. Can you attach MyLoader?

Attaching a test case using  
{noformat}
   public class PigStorageWithStatistics extends PigStorage {
{noformat}
from org.apache.pig.test.  Even though both PigStorage and 
PigStorageWithStatistics returns Utf8StorageConverter, testcase fails with 
"Cannot determine how to convert the
bytearray to string."

Note that I created PIG-3295 for dealing with the case when casting fails even 
when union comes from the same loader.

Figuring out if the loaders were same was easy with calling 'equals' for the 
FuncSpec instances.  I don't know how to achieve this easily for comparing 
casters.


> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: pig-3293-test-only-v01.patch
>
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3310) ImplicitSplitInserter does not generate new uids for nested schema fields, leading to miscomputations

2013-05-23 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665415#comment-13665415
 ] 

Koji Noguchi commented on PIG-3310:
---

I also don't have a good understanding on these, but the change looks 
reasonable to me.  [~daijy], original uid reassignment was added in PIG-1705 
for the self-join.  Can you take a look?

> ImplicitSplitInserter does not generate new uids for nested schema fields, 
> leading to miscomputations
> -
>
> Key: PIG-3310
> URL: https://issues.apache.org/jira/browse/PIG-3310
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
> Environment: Reproduced on 0.10.1, 0.11.1 and trunk
>Reporter: Clément Stenac
> Attachments: generate-uid-for-nested-fields.patch
>
>
> Hi,
> Consider the following example
> {code}
> inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int);
> tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score;
> D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId 
> as shopId, score AS score;
> D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId 
> as shopId, score AS score;
> J = JOIN D1 By shopId, D2 by shopId;
> K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS 
> member_id2, D1::shopId as shop;
> EXPLAIN K;
> DUMP K;
> {code}
> It is a bit weird written like that, but it provides a minimal reproduction 
> case (in the real case, the "tuplified" phase came from a multi-key grouping).
> On input data:
> {code}
> 1   1001101
> 1   1002103
> 1   1003102
> 1   1004102
> 2   1005101
> 2   1003101
> 2   1002123
> 3   1042101
> 3   1005101
> 3   1002133
> {code}
> This will give a wrongful output like ..
> {code}
> (1,1001,1001)
> (1,1002,1002)
> (1,1002,1002)
> (1,1002,1002)
> {code}
> The second column should be a member id so (1,2,3,4,5).
> In the initial case, there was a FILTER (member_id1 < member_id2) after K, 
> and computation failed because of PushUpFilter optimization mistakenly moving 
> the LOFilter operation before the join, at a place where it tried to work on 
> a tuple and failed.
> My understanding of the issue is that when the ImplicitSplitInserter creates 
> the LOSplitOutputs, it will correctly reset the schema, and the LOSplitOutput 
> will regenerate uids for the fields of D1 and D2 ... but will not do that on 
> the tuple members.
> The logical plan after the ImplicitSplitINserter will look like (simplified)
> {code}
>|---D1: (Name: LOForEach Schema: 
> memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125,
>  124]
> |---tuplified: (Name: LOSplitOutput Schema: 
> tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127]
>|---tuplified: (Name: LOSplit Schema: 
> tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
> |---D2: (Name: LOForEach Schema: 
> memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125,
>  124]
> |---tuplified: (Name: LOSplitOutput Schema: 
> tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130]
>|---tuplified: (Name: LOSplit Schema: 
> tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123]
> {code}
> tuplified correctly gets a new uid (127 and 130) but the members of the tuple 
> don't. When they get reprojected, both branches have the same uid and the 
> join looks like:
> {code}
> |---J: (Name: LOJoin(HASH) Schema: 
> D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125,
>  124, 132]ColumnPrune:OutputUids=[125, 124, 132]
> |   |
> |   shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1)
> |   |
> |   shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1)
> {code}
> If for example instead of reprojecting "memberId", we project "memberId+0", a 
> new node is created, and ultimately the two branches of the join will 
> correctly get separate uids.
> My understanding is that LOSplitOutput.getSchema() should recurse on nested 
> schema fields. However, I only have a light understanding of all of the 
> logical plan handling, so I may be completely wrong.
> Attached is a draft of patch and a test reproducing the issue. Unfortunately, 
> I haven't been able to run all unit tests with the "fix" (I have some weird 
> hangs)
> I'd be happy if you could indicate if that looks like completely the wron

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668630#comment-13668630
 ] 

Koji Noguchi commented on PIG-3257:
---

Would this ensure that same unique identifier is reproduced when (map) task 
attempt is retried?  Otherwise, I'm afraid it would lead to a random pig 
behavior when we use this id as the map-reduce key.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668705#comment-13668705
 ] 

Koji Noguchi commented on PIG-3257:
---

bq. I can't see how it would matter whether it produced random key X1 vs random 
key X2 for any given record.

If used in mapreduce key, this can lead to incomplete/incorrect output when 
mappers are retried.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668717#comment-13668717
 ] 

Koji Noguchi commented on PIG-3257:
---

bq. incomplete/incorrect output 
I mean, this can result in missing records or redundant records.  (support 
nightmare for me.)

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-29 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195
 ] 

Koji Noguchi commented on PIG-3257:
---

With your first example, say you have _n_ input records. 1 mapper 2 reducers.
{noformat}
A = load ...
B = group A by UUID();
STORE B ...
{noformat}
This job could successfully finish with output ranging from 0 to 2n records.
For example, sequence of events can be, 
   # mapper0_attempt0 finish with n outputs and say all n uuid keys were 
assigned to reducer0.
   # reducer0_attempt0 pulls map outputs and produces _n_ outputs.
   # reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could 
be fetch failure or node failure).
   # mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to 
reducer1.
   # reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs.
   # job finish successfully with 2n outputs.

This is certainly unexpected to users.

Now, with your second example
{noformat}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{noformat}

Let's say pig decides to implement the two group by (C and E) with one 
map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and 
assume pig decides to partition all group by in _C_ to reducer0 and _E_ to 
reducer1.  Now, using the same story as above, there could be a case where 
reducer0(group-by-C) gets one set of UUID from mapper0_attempt0  and 
reducer1(group-by-E) gets another completely different set of UUID from 
mapper0_attempt1.

When this happen, join _G_ would produce 0 results which is unexpected to users.
Of course this depends on how pig performs the above query but I hope it 
demonstrates how tricky it gets when introducing a pure random id in hadoop.

What's worst about all these is that this is a corner case which won't get 
caught in users' QE phases and it would only manifest during production 
pipeline.  Users would then yell at me for corrupted output from successful 
jobs.  Thus my previous comment on "support nightmare".






> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3355) ColumnMapKeyPrune bug with distinct operator

2013-06-18 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686815#comment-13686815
 ] 

Koji Noguchi commented on PIG-3355:
---

bq. Committed to trunk. Thanks Jeremy!

[~aniket486], status is still "Patch Available"?  
Also, can we patch 0.11 as well so that it'll be included if we release another 
0.11.* ?

> ColumnMapKeyPrune bug with distinct operator
> 
>
> Key: PIG-3355
> URL: https://issues.apache.org/jira/browse/PIG-3355
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.10.1, 0.11.1
>Reporter: Jeremy Karn
>Assignee: Jeremy Karn
> Attachments: PIG-3355.patch
>
>
> We came across a bug that happens when you have a distinct operator 
> immediately followed by a union where the result of the union has at least 
> one column that will be pruned by ColumnMapKeyPrune.  There's a test showing 
> an example script in the submitted patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-20 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3385:
--

Attachment: pig-3385-v01.patch

Wondering if custom partitioner ever worked for distinct.  

Looks like partitioner info is passed through POGlobalRearrange but "distinct" 
doesn't use it. 

Uploading an initial patch that just passes that info through PODistinct. 

It's the first time for me to touch the backend code. Appreciate if someone can 
take a look.  I'll upload a testcase next.

> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Reporter: Will Oberman
>Priority: Minor
> Attachments: pig-3385-v01.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-21 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3385:
--

Component/s: (was: documentation)
 impl
   Assignee: Koji Noguchi

> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Will Oberman
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3385-v01.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-21 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3385:
--

Attachment: pig-3385-v02.patch

Uploading a patch with test.  Noticed that original test for custom 
partitioners didn't give different partition results than the default so added 
one silly partitioner that always return 1 (second reducer).

> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Will Oberman
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3385-v01.patch, pig-3385-v02.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-21 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3435:
-

 Summary: Custom Partitioner not working with MultiQueryOptimizer
 Key: PIG-3435
 URL: https://issues.apache.org/jira/browse/PIG-3435
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Koji Noguchi
Assignee: Koji Noguchi


When looking at PIG-3385, noticed some issues in handling of custom partitioner 
with multi-query optimization.

{noformat}
C1 = group B1 by col1 PARTITION BY
   org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
C2 = group B2 by col1 PARTITION BY
   org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
{noformat}
This seems to be merged to one mapreduce job correctly but custom partitioner 
information was lost.

{noformat}
C1 = group B1 by col1 PARTITION BY 
org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
C2 = group B2 by col1 parallel 2;
{noformat}
This seems to be merged even though they should run on two different 
partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-21 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746937#comment-13746937
 ] 

Koji Noguchi commented on PIG-3385:
---

While looking at this jira, noticed custom partitioner being dropped when run 
with multi query optimization.  Created PIG-3435.

> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Will Oberman
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3385-v01.patch, pig-3385-v02.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-22 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3435:
--

Attachment: pig-3435-v01.patch

Looking at the multi-query optimization code and documents.  I chickened out. 

Taking the same approach as PIG-1108 and simply skipping the MR jobs with 
custom partitioner.

Attaching the test case soon.

> Custom Partitioner not working with MultiQueryOptimizer
> ---
>
> Key: PIG-3435
> URL: https://issues.apache.org/jira/browse/PIG-3435
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3435-v01.patch
>
>
> When looking at PIG-3385, noticed some issues in handling of custom 
> partitioner with multi-query optimization.
> {noformat}
> C1 = group B1 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> {noformat}
> This seems to be merged to one mapreduce job correctly but custom partitioner 
> information was lost.
> {noformat}
> C1 = group B1 by col1 PARTITION BY 
> org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 parallel 2;
> {noformat}
> This seems to be merged even though they should run on two different 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-23 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3435:
--

Attachment: pig-3435-v02_skipcustompatitioner_for_merge.patch

While looking at the testcase, found PIG-2627 where it fixed one of the issues 
with custom-partitioner and multiquery optimization (but not all).

Specific case mentioned on that ticket is handled on that jira and it works, 
but my patch here simply skips multiquery optimization for ALL custom 
partitioner jobs.

Since it's sort of a correctness issue, I want this fix to be back-ported to 
0.11.  And for that, I kept the change to be simple.

Can we create a separate jira for reviving custom-partitioner + multiquery 
optimization for later releases?


> Custom Partitioner not working with MultiQueryOptimizer
> ---
>
> Key: PIG-3435
> URL: https://issues.apache.org/jira/browse/PIG-3435
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Attachments: pig-3435-v01.patch, 
> pig-3435-v02_skipcustompatitioner_for_merge.patch
>
>
> When looking at PIG-3385, noticed some issues in handling of custom 
> partitioner with multi-query optimization.
> {noformat}
> C1 = group B1 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> {noformat}
> This seems to be merged to one mapreduce job correctly but custom partitioner 
> information was lost.
> {noformat}
> C1 = group B1 by col1 PARTITION BY 
> org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 parallel 2;
> {noformat}
> This seems to be merged even though they should run on two different 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PIG-3440) MultiQuery to work with custom partitioner

2013-08-27 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi reassigned PIG-3440:
-

Assignee: Koji Noguchi

Taking a look.  I'm not sure if we should limit the merging to jobs with same 
custom partitioners or 
if we can merge all jobs and have single partitioner that delegates to the 
corresponding partitioner based on inputs.

(I'm still learning multi-query optimization.  I may be off on this.)

> MultiQuery to work with custom partitioner
> --
>
> Key: PIG-3440
> URL: https://issues.apache.org/jira/browse/PIG-3440
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Koji Noguchi
>
> Currently Pig disable multiquery in case of custom partitioner in PIG-3435. 
> However, when custom partitioner are the same, we can still use multiquery. 
> This is the Jira ticket to track this optimization.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3435) Custom Partitioner not working with MultiQueryOptimizer

2013-08-27 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751999#comment-13751999
 ] 

Koji Noguchi commented on PIG-3435:
---

Thanks Daniel!  
Can we back-port this patch to 0.11? 
(That was one of the motivation for me to keep the patch simple.)

I'll work on PIG-3440.


> Custom Partitioner not working with MultiQueryOptimizer
> ---
>
> Key: PIG-3435
> URL: https://issues.apache.org/jira/browse/PIG-3435
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
> Fix For: 0.12
>
> Attachments: pig-3435-v01.patch, 
> pig-3435-v02_skipcustompatitioner_for_merge.patch
>
>
> When looking at PIG-3385, noticed some issues in handling of custom 
> partitioner with multi-query optimization.
> {noformat}
> C1 = group B1 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 PARTITION BY
>org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> {noformat}
> This seems to be merged to one mapreduce job correctly but custom partitioner 
> information was lost.
> {noformat}
> C1 = group B1 by col1 PARTITION BY 
> org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2;
> C2 = group B2 by col1 parallel 2;
> {noformat}
> This seems to be merged even though they should run on two different 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3385) DISTINCT no longer uses custom partitioner

2013-08-27 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13752000#comment-13752000
 ] 

Koji Noguchi commented on PIG-3385:
---

Thanks Daniel!  Can we back-port this patch and PIG-3435 to 0.11?  Without 
them, custom partitioner is almost unusable.


> DISTINCT no longer uses custom partitioner
> --
>
> Key: PIG-3385
> URL: https://issues.apache.org/jira/browse/PIG-3385
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Will Oberman
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.12
>
> Attachments: pig-3385-v01.patch, pig-3385-v02.patch
>
>
> From u...@pig.apache.org:  It looks like an optimization was put in to make 
> distinct use a special partitioner which prevents the user from setting the 
> partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-08-30 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755016#comment-13755016
 ] 

Koji Noguchi commented on PIG-3293:
---

I hit a worse case today.

(1) Case I mentioned originally was with union between loaderA and loaderB in 
which both return the same loadCaster, Utf8StorageConverter.  Typecast failing 
after the union. 

One I saw today.
(2) Single Loader but with different argument resulting in a typecast error.
{noformat}
A = load 'data1' using LoaderA('col1') as (a:bytearray);
B = load 'data1' using LoaderA('col2') as (a:bytearray);
C = union ...; D = foreach C generate (chararray)a; store D ...
{noformat}


I wish I can simply check the classname of the loaders for the uniqueness of 
loadcaster.
But then, I saw HBaseStorage returning different loadcaster depending on its 
input parameters.

One other approach I'm thinking is, is it possible to push the typecast above 
the union so that we can perform loader.getLoadCaster().bytsToCharArray for 
each input to union ?

> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: pig-3293-test-only-v01.patch
>
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

dev@pig.apache.org

2013-09-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756712#comment-13756712
 ] 

Koji Noguchi commented on PIG-3293:
---

bq. Also improve the error message to indicate possible causes would help.
I've updated the error message a bit in PIG-3295.  
However, it is still vague in that I cannot tell whether the failure was due to 
UDF with no loadcaster or Union with two different loaders.


> Casting fails after Union from two data sources&loaders
> ---
>
> Key: PIG-3293
> URL: https://issues.apache.org/jira/browse/PIG-3293
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Priority: Minor
> Attachments: pig-3293-test-only-v01.patch
>
>
> Script similar to 
> {noformat}
> A = load 'data1' using MyLoader() as (a:bytearray);
> B = load 'data2' as (a:bytearray);
> C = union onschema A,B;
> D = foreach C generate (chararray)a;
> Store D into './out';
> {noformat}
> fails with 
>java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: 
> ERROR 1075: Received a bytearray from the UDF. Cannot determine how to 
> convert the bytearray to string.
> Both MyLoader and PigStorage use the default Utf8StorageConverter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-09-03 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3295:
--

Attachment: pig-3295-v02.patch

Just noticed my previous patch wasn't created with '--no-prefix' option. 
Reattaching.

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch, pig-3295-v02.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3447) Compiler warning message dropped for CastLineageSetter and others with no enum kind

2013-09-03 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3447:
-

 Summary: Compiler warning message dropped for CastLineageSetter 
and others with no enum kind
 Key: PIG-3447
 URL: https://issues.apache.org/jira/browse/PIG-3447
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Assignee: Koji Noguchi
Priority: Trivial


Following compiler warning was never shown to users for two reasons.
{noformat}
//./src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java
  106  if(inLoadFunc == null){
  107  String msg = "Cannot resolve load function to use for casting from " 
+
  108  DataType.findTypeName(inType) + " to " +
  109  DataType.findTypeName(outType) + ". ";
  110  msgCollector.collect(msg, MessageType.Warning);
  111  }
{noformat}

# CompilationMessageCollector.logMessages or logAllMessages not being called 
after CastLineageSetter.visit.
# CompilationMessageCollector.collect with no KIND don't print out any messages 
when aggregate.warning=true (default)



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3447) Compiler warning message dropped for CastLineageSetter and others with no enum kind

2013-09-03 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3447:
--

Attachment: pig-3447-v01.txt

With the patch, it'll print out 

{noformat}
2013-09-03 13:58:20,625 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning NO_LOAD_FUNCTION_FOR_CASTING_BYTEARRAY 1 time(s).
{noformat}


If anyone still calls CompilationMessageCollector.collect without enum KIND, 
then it'll at least print out 

{noformat}
2013-08-30 22:25:20,940 [main] WARN  org.apache.pig.PigServer - Encountered 
Warning Aggregated unknown kind messages.  Please set -Daggregate.warning=false 
to retrieve these messages 1 time(s).
{noformat}

Before, it wasn't printing out anything.

With -Daggregate.warning=false, it'll print out the following (even without 
this patch).

{noformat}
2013-09-03 14:24:48,275 [main] WARN  org.apache.pig.PigServer - Cannot resolve 
load function to use for casting from bytearray to chararray.
{noformat}

> Compiler warning message dropped for CastLineageSetter and others with no 
> enum kind
> ---
>
> Key: PIG-3447
> URL: https://issues.apache.org/jira/browse/PIG-3447
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Trivial
> Attachments: pig-3447-v01.txt
>
>
> Following compiler warning was never shown to users for two reasons.
> {noformat}
> //./src/org/apache/pig/newplan/logical/visitor/CastLineageSetter.java
>   106  if(inLoadFunc == null){
>   107  String msg = "Cannot resolve load function to use for casting from 
> " +
>   108  DataType.findTypeName(inType) + " to " +
>   109  DataType.findTypeName(outType) + ". ";
>   110  msgCollector.collect(msg, MessageType.Warning);
>   111  }
> {noformat}
> # CompilationMessageCollector.logMessages or logAllMessages not being called 
> after CastLineageSetter.visit.
> # CompilationMessageCollector.collect with no KIND don't print out any 
> messages when aggregate.warning=true (default)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2315) Make as clause work in generate

2013-09-03 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757098#comment-13757098
 ] 

Koji Noguchi commented on PIG-2315:
---

> because it is not working anyway.
>
There's at least one case it's working for our users.

{noformat}
a = load 'input.txt' as (nb:bag{});
b = foreach a generate flatten(nb) as (year, name:bytearray);
c = filter b by name == 'user1';
dump c;
{noformat}

Above case works. But without the ':bytearray' in relation b, it fails.

{noformat}
a = load 'input.txt' as (nb:bag{});
b = foreach a generate flatten(nb) as (year, name);
c = filter b by name == 'user1';
dump c;
{noformat}
"Front End: ERROR 1052: Cannot cast bytearray to chararray"

Please keep the first case valid.  (Thanks [~fuding] for this example.)
Error message in the second case is misleading that it's actually trying to 
typecast NULL to chararray.



> Make as clause work in generate
> ---
>
> Key: PIG-2315
> URL: https://issues.apache.org/jira/browse/PIG-2315
> Project: Pig
>  Issue Type: Bug
>Reporter: Olga Natkovich
>Assignee: Gianmarco De Francisci Morales
> Fix For: 0.12
>
>
> Currently, the following syntax is supported and ignored causing confusing 
> with users:
> A1 = foreach A1 generate a as a:chararray ;
> After this statement a just retains its previous type

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-09-05 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759189#comment-13759189
 ] 

Koji Noguchi commented on PIG-3295:
---

bq. How about doing more aggressively by checking LoadCaster?
That was my first approach, but as I also wrote in PIG-3293, 
"Figuring out if the loaders were same was easy with calling 'equals' for the 
FuncSpec. I don't know how to achieve this easily for comparing casters."

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch, pig-3295-v02.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-09-05 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13759228#comment-13759228
 ] 

Koji Noguchi commented on PIG-3295:
---

bq. Can we instantiate the LoadFunc (with parameters) and then compare?

Possible, but I can only compare the classnames?  For the funcspec comparisons, 
we're comparing the classname as well as the parameters passed to the 
constructors.


> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch, pig-3295-v02.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-09-06 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760297#comment-13760297
 ] 

Koji Noguchi commented on PIG-3295:
---

bq. How about we make one exception in the case LoadCaster has only default 
construct?

If you mean, make an exception only for Utf8StorageConverter, that makes sense 
since we have full control over the class and we know that classname check is 
sufficient for equality.  Let me try coming up with a new patch.


> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch, pig-3295-v02.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PIG-3295) Casting from bytearray failing after Union (even when each field is from a single Loader)

2013-09-10 Thread Koji Noguchi (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3295:
--

Attachment: pig-3295-v03.patch

Attaching a patch which includes Daniel's suggestion on comparing the 
LoadCaster (and limiting to ones with default constructors only).

Haven't run full test yet.

> Casting from bytearray failing after Union (even when each field is from a 
> single Loader)
> -
>
> Key: PIG-3295
> URL: https://issues.apache.org/jira/browse/PIG-3295
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Attachments: pig-3295-v01.patch, pig-3295-v02.patch, 
> pig-3295-v03.patch
>
>
> One example
> {noformat}
> A = load 'data1.txt' as line:bytearray;
> B = load 'c1.txt' using TextLoader() as cookie1;
> C = load 'c2.txt' using TextLoader() as cookie2;
> B2 = join A by line, B by cookie1;
> C2 = join A by line, C by cookie2;
> D = union onschema B2,C2; -- D: {A::line: bytearray,B::cookie1: 
> bytearray,C::cookie2: bytearray}
> E = foreach D generate (chararray) line, (chararray) cookie1, (chararray) 
> cookie2;
> dump E;
> {noformat}
> This script fails at runtime with 
> "Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1075: 
> Received a bytearray from the UDF. Cannot determine how to convert the 
> bytearray to string."
> This is different from PIG-3293 such that each field in 'D' belongs to a 
> single loader whereas on PIG-3293, it came from multiple loader.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-12 Thread Koji Noguchi (JIRA)

Koji Noguchi created PIG-3458:
-

 Summary: ScalarExpression lost with multiquery optimization
 Key: PIG-3458
 URL: https://issues.apache.org/jira/browse/PIG-3458
 Project: Pig
  Issue Type: Bug
Reporter: Koji Noguchi
Assignee: Koji Noguchi


Our user reported an issue where their scalar results goes missing when having 
two store statements.

{noformat}
A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
B = group A all;
C = foreach B generate SUM(A.count) as total ;
store C into 'deleteme6_C' using PigStorage(',');

Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
Y = group Z by id;
X = foreach Y generate group, C.total;
store X into 'deleteme6_X' using PigStorage(',');

Inputs
 pig> cat test1.txt
a   1
b   2
c   8
d   9
 pig> cat test2.txt
a   z
b   y
c   x
 pig>
{noformat}

Result X should contain the total count of '20' but instead it's empty.

{noformat}
 pig> cat deleteme6_C/part-r-0
20
 pig> cat deleteme6_X/part-r-0
x,
y,
z,
 pig>
{noformat}

This works if we take out first "store C" statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3458) ScalarExpression lost with multiquery optimization

2013-09-12 Thread Koji Noguchi (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13765964#comment-13765964
 ] 

Koji Noguchi commented on PIG-3458:
---

Reason it gets lost is, we store C using PigStorage but ReadScalars tries to 
read it by a hardcoded InterStorage. 

{noformat}
...
[First mapreduce job]
Reduce Plan
C: Store(/.../deleteme6_C:PigStorage(',')) - scope-17
|
...

[Second mapreduce job]
|   POUserFunc(org.apache.pig.impl.builtin.ReadScalars)[long] - scope-31
|   |
|   |---Constant(0) - scope-29
|   |
|   |---Constant(/.../deleteme6_C) - scope-30
{noformat}

Trying to understand what the fix should be.
1. Make ReadScalars use the corresponding Loader.
2. Split relation 'C' so that we store them in both PigStorage AND InterStorage.

I'm guessing latter, but appreciate your feedback.


> ScalarExpression lost with multiquery optimization
> --
>
> Key: PIG-3458
> URL: https://issues.apache.org/jira/browse/PIG-3458
> Project: Pig
>  Issue Type: Bug
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>
> Our user reported an issue where their scalar results goes missing when 
> having two store statements.
> {noformat}
> A = load 'test1.txt' using PigStorage('\t') as (a:chararray, count:long);
> B = group A all;
> C = foreach B generate SUM(A.count) as total ;
> store C into 'deleteme6_C' using PigStorage(',');
> Z = load 'test2.txt' using PigStorage('\t') as (a:chararray, id:chararray );
> Y = group Z by id;
> X = foreach Y generate group, C.total;
> store X into 'deleteme6_X' using PigStorage(',');
> Inputs
>  pig> cat test1.txt
> a   1
> b   2
> c   8
> d   9
>  pig> cat test2.txt
> a   z
> b   y
> c   x
>  pig>
> {noformat}
> Result X should contain the total count of '20' but instead it's empty.
> {noformat}
>  pig> cat deleteme6_C/part-r-0
> 20
>  pig> cat deleteme6_X/part-r-0
> x,
> y,
> z,
>  pig>
> {noformat}
> This works if we take out first "store C" statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1187 matches

Mail list logo