[jira] Commented: (PIG-1166) A bit change of the interface of Tuple DataBag ( make the set and append method return this)

2009-12-21 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793118#action_12793118
 ] 

Jeff Zhang commented on PIG-1166:
-

A better example of illustrating my idea is to build a DataBag.

The current method:

{code}
BagFactory BAGFACTORY = BagFactory.getInstance();
TupleFactory TUPLEFACTORY = TupleFactory.getInstance();
DataBag bag = BAGFACTORY.newDefaultBag();
Tuple tuple_1 = TUPLEFACTORY.newTuple(1);
tuple_1.set(0, item_1);
bag.add(tuple_1);
Tuple tuple_2 = TUPLEFACTORY.newTuple(1);
tuple_2.set(0, item_2);
bag.add(tuple_2);
{code}

and if we change the interface, we can write the code as following:
{code}
BagFactory BAGFACTORY = BagFactory.getInstance();
TupleFactory TUPLEFACTORY = TupleFactory.getInstance();
DataBag bag = BAGFACTORY.newDefaultBag();
bag.add(TUPLEFACTORY.newTuple(1).set(0,item_1)).add(TUPLEFACTORY.newTuple(1).set(0,item_2));
{code}

The second piece of code snippet is more readable and concise in my opinion.

 A bit change of the interface of Tuple  DataBag ( make the set and append 
 method return this)
 --

 Key: PIG-1166
 URL: https://issues.apache.org/jira/browse/PIG-1166
 Project: Pig
  Issue Type: Improvement
Reporter: Jeff Zhang
Priority: Minor

 When people write unit test for UDF, they always need to build a tuple or 
 bag. If we change the interface of Tuple and DataBag,  make the set and 
 append method return this, it can decrease the code size.  e.g. Now people 
 have to write the following code to build a Tuple:
 {code}
 Tuple tuple=TupleFactory.getInstance().newTuple(3);
 tuple.set(0,item_0);
 tuple.set(1,item_1);
 tuple.set(2,item_2);
 {code}
 If we change the interface,  make the set and append method return this, we 
 can rewrite the above code like this:
 {code}
 Tuple tuple=TupleFactory.getInstance().newTuple(3);
 tuple.set(0,item_0).set(1,item_1).set(2,item_2);
 {code}
 This interface change won't have back compatibility problem and I think 
 there's no performance problem too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-919) Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText when doing simple group

2009-12-21 Thread George Mavromatis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793121#action_12793121
 ] 

George Mavromatis commented on PIG-919:
---

 This was closed Fixed when it should have been closed Won't Fix or Later.

Can we then resolve it with the correct resolution (Won't Fix or Later)?

 Where are you seeing this error?

I am seeing it in product code that I cannot refer to here. It has occurred 
twice, one instance of which was referred to this ticket and closed. I will 
send you more information offline.

 Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText when doing simple group
 --

 Key: PIG-919
 URL: https://issues.apache.org/jira/browse/PIG-919
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Viraj Bhat
Assignee: Viraj Bhat
 Fix For: 0.3.0

 Attachments: GenHashList.java, mapscript.pig, mymapudf.jar


 I have a Pig script, which takes in a student file and generates a bag of 
 maps.  I later want to group on the value of the key name0 which 
 corresponds to the first name of the student.
 {code}
 register mymapudf.jar;
 data = LOAD '/user/viraj/studenttab10k' AS 
 (somename:chararray,age:long,marks:float);
 genmap = foreach data generate flatten(mymapudf.GenHashList(somename,' ')) as 
 bp:map[], age, marks;
 getfirstnames = foreach genmap generate bp#'name0' as firstname, age, marks;
 filternonnullfirstnames = filter getfirstnames by firstname is not null;
 groupgenmap = group filternonnullfirstnames by firstname;
 dump groupgenmap;
 {code}
 When I execute this code, I get an error in the Map Phase:
 ===
 java.io.IOException: Type mismatch in key from map: expected 
 org.apache.pig.impl.io.NullableBytesWritable, recieved 
 org.apache.pig.impl.io.NullableText
   at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:415)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:242)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
   at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-21 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793269#action_12793269
 ] 

Thejas M Nair commented on PIG-1149:


+1 to the lsr branch version.
But the FIXME comment in the test case is not correct. There does not have to 
be  1 samples sampled for every map, if the number of rows are very small. 
Though this behavior is different from earlier version of the trunk version of 
poisson sampler, it satisfies the requirements as per 
http://wiki.apache.org/pig/PigSampler and PIG-1062.
I can remove the FIXME comment as part of the patch I am going to submit to fix 
the other test case.


 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-21 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-1110.
-

  Resolution: Fixed
Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])

+1, Patch committed to load-store-redesign branch - thanks Richard!

 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch, PIG-1110.patch, PIG_1110_Jeff.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1094) Fix unit tests corresponding to source changes so far

2009-12-21 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793286#action_12793286
 ] 

Pradeep Kamath commented on PIG-1094:
-

+1, PIG-1094_4.patch checked in, thanks Richard!

 Fix unit tests corresponding to source changes so far
 -

 Key: PIG-1094
 URL: https://issues.apache.org/jira/browse/PIG-1094
 Project: Pig
  Issue Type: Sub-task
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1094.patch, PIG-1094_2.patch, PIG-1094_3.patch, 
 PIG-1094_4.patch


 The check-in's so far on load-store-redesign branch have nor addressed unit 
 test failures due to interface changes. This jira is to track the task of 
 making the common case unit tests work with the new interfaces. Some aspects 
 of the new proposal like using LoadCaster interface for casting, making local 
 mode work have not been completed yet. Tests which are failing due to those 
 reasons will not be fixed in this jira and addressed in the jiras 
 corresponding to those tasks

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1165:


Attachment: PIG-1165-1.patch

 Signature of loader does not set correctly for order by
 ---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1165-1.patch


 In pig, we need to set signature for each LoadFunc. Currently, we use alias 
 of the LOAD statement in Pig script of the signature of the LoadFunc. One use 
 case we have is in LoadFunc, we use signature to retrieve pruned columns of 
 each specific loader. However, in order by statement, we do not set 
 signature for the loader correctly. In this case, we do not prune the loader 
 correctly. 
 For example, the following script produce wrong result:
 {code}
 a = load '1.txt' as (a0, a1);
 b = order a by a1;
 c = order b by a1;
 d = foreach c generate a1;
 dump d;
 {code}
 1.txt:
 {code}
 1   a
 2   b
 3   c
 6   d
 5   e
 {code}
 expected result:
 a
 b
 c
 d
 e
 current result:
 1
 2
 3
 5
 6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1165:


Status: Patch Available  (was: Open)

 Signature of loader does not set correctly for order by
 ---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1165-1.patch


 In pig, we need to set signature for each LoadFunc. Currently, we use alias 
 of the LOAD statement in Pig script of the signature of the LoadFunc. One use 
 case we have is in LoadFunc, we use signature to retrieve pruned columns of 
 each specific loader. However, in order by statement, we do not set 
 signature for the loader correctly. In this case, we do not prune the loader 
 correctly. 
 For example, the following script produce wrong result:
 {code}
 a = load '1.txt' as (a0, a1);
 b = order a by a1;
 c = order b by a1;
 d = foreach c generate a1;
 dump d;
 {code}
 1.txt:
 {code}
 1   a
 2   b
 3   c
 6   d
 5   e
 {code}
 expected result:
 a
 b
 c
 d
 e
 current result:
 1
 2
 3
 5
 6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793296#action_12793296
 ] 

Olga Natkovich commented on PIG-1143:
-

+1 on the code changes. There is a extra debug trace in the code that I will 
remove as part of the commit

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Locking trunk for commits to merge on load-store-redesign branch

2009-12-21 Thread Pradeep Kamath
Hi,

  PIG-1143 and PIG-1149 need special handling on the load-store-redesign
branch. PIG-1143 should not be applied to the branch since the code is
not applicable and for PIG-1149 there is a separate patch. I am
beginning a merge of load-store-redesign branch with current head of
trunk. These two patches will be committed to trunk once I complete the
merge and in the svn commit message for the merge on load-store-redesign
branch, I will record the revision after these two patches are
committed. This is so that the next time we merge from trunk to
load-store-redesign branch we merge from the point after these patches.
This mail is to let all committers know that they should hold off
commits till this process is done at which point I will send an all
clear email.

 

Thanks,

Pradeep



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-21 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1143:


Fix Version/s: 0.6.0

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793321#action_12793321
 ] 

Olga Natkovich commented on PIG-1149:
-

patch pig_1149.patch is committed to the trunk.

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793322#action_12793322
 ] 

Olga Natkovich commented on PIG-1143:
-

patch committed to the trunk. Will commit to 0.6 branch tomorrow.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



DONE - trunk open for commits RE: Locking trunk for commits to merge on load-store-redesign branch

2009-12-21 Thread Pradeep Kamath
Hi,
  The process outlined below is now completed and the trunk is open for
commits which do not conflict with load-store-redesign branch.

Thanks,
Pradeep

-Original Message-
From: Pradeep Kamath [mailto:prade...@yahoo-inc.com] 
Sent: Monday, December 21, 2009 10:50 AM
To: pig-dev@hadoop.apache.org
Subject: Locking trunk for commits to merge on load-store-redesign
branch

Hi,

  PIG-1143 and PIG-1149 need special handling on the load-store-redesign
branch. PIG-1143 should not be applied to the branch since the code is
not applicable and for PIG-1149 there is a separate patch. I am
beginning a merge of load-store-redesign branch with current head of
trunk. These two patches will be committed to trunk once I complete the
merge and in the svn commit message for the merge on load-store-redesign
branch, I will record the revision after these two patches are
committed. This is so that the next time we merge from trunk to
load-store-redesign branch we merge from the point after these patches.
This mail is to let all committers know that they should hold off
commits till this process is done at which point I will send an all
clear email.

 

Thanks,

Pradeep



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-21 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1149:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch for lsr branch also committed, thanks Dmitriy!

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch, pig_1149_lsr-branch.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1158) pig command line -M option doesn't support table union correctly (comma seperated paths)

2009-12-21 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1158:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed to the trunk. Thanks, Richard

 pig command line -M option doesn't support table union correctly (comma 
 seperated paths)
 

 Key: PIG-1158
 URL: https://issues.apache.org/jira/browse/PIG-1158
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1158.patch


 for example, load (1.txt,2.txt) USING 
 org.apache.hadoop.zebra.pig.TableLoader()
 i see this errror from stand out:
 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/1.txt,2.txt does not 
 exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-21 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1141:
--

Attachment: PIG-1141.patch

This patch made changes following Alan's comments.

 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1166) A bit change of the interface of Tuple DataBag ( make the set and append method return this)

2009-12-21 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793350#action_12793350
 ] 

Dmitriy V. Ryaboy commented on PIG-1166:


+1 to the idea, and we don't have to stop at Tuple and Bag factories. There are 
plenty of other places that this can be useful in (like all of the Logical and 
Physical operators).

 A bit change of the interface of Tuple  DataBag ( make the set and append 
 method return this)
 --

 Key: PIG-1166
 URL: https://issues.apache.org/jira/browse/PIG-1166
 Project: Pig
  Issue Type: Improvement
Reporter: Jeff Zhang
Priority: Minor

 When people write unit test for UDF, they always need to build a tuple or 
 bag. If we change the interface of Tuple and DataBag,  make the set and 
 append method return this, it can decrease the code size.  e.g. Now people 
 have to write the following code to build a Tuple:
 {code}
 Tuple tuple=TupleFactory.getInstance().newTuple(3);
 tuple.set(0,item_0);
 tuple.set(1,item_1);
 tuple.set(2,item_2);
 {code}
 If we change the interface,  make the set and append method return this, we 
 can rewrite the above code like this:
 {code}
 Tuple tuple=TupleFactory.getInstance().newTuple(3);
 tuple.set(0,item_0).set(1,item_1).set(2,item_2);
 {code}
 This interface change won't have back compatibility problem and I think 
 there's no performance problem too.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1159) merge join right side table does not support comma seperated paths

2009-12-21 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1159:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. Thanks, Richard

 merge join right side table does not support comma seperated paths
 --

 Key: PIG-1159
 URL: https://issues.apache.org/jira/browse/PIG-1159
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Jing Huang
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-1159.patch


 For example this is my script:(join_jira1.pig)
 register /grid/0/dev/hadoopqa/jars/zebra.jar;
 --a1 = load '1.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --a2 = load '2.txt' as (a:int, 
 b:float,c:long,d:double,e:chararray,f:bytearray,r1(f1:chararray,f2:chararray),m1:map[]);
 --sort1 = order a1 by a parallel 6;
 --sort2 = order a2 by a parallel 5;
 --store sort1 into 'asort1' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort2' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort1 into 'asort3' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 --store sort2 into 'asort4' using 
 org.apache.hadoop.zebra.pig.TableStorer('[a,b,c,d]');
 joinl = LOAD 'asort1,asort2' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joinr = LOAD 'asort3,asort4' USING 
 org.apache.hadoop.zebra.pig.TableLoader('a,b,c,d', 'sorted');
 joina = join joinl by a, joinr by a using merge ;
 dump joina;
 ==
 here is the log:
 Backend error message
 -
 java.lang.IllegalArgumentException: Pathname 
 /user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  from 
 hdfs://gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort3,hdfs:/gsbl90380.blue.ygrid.yahoo.com/user/hadoopqa/asort4
  is not a valid DFS filename.
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
 at 
 org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:147)
 at 
 org.apache.pig.impl.io.FileLocalizer.fullPath(FileLocalizer.java:534)
 at org.apache.pig.impl.io.FileLocalizer.open(FileLocalizer.java:338)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.seekInRightStream(POMergeJoin.java:398)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNext(POMergeJoin.java:184)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:253)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.map(PigMapOnly.java:65)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:159)
 Pig Stack Trace
 ---
 ERROR 6015: During execution, encountered a Hadoop error.
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
 open iterator for alias joina
 at org.apache.pig.PigServer.openIterator(PigServer.java:482)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:539)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:386)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: 
 During execution, encountered a Hadoop error.
 at 
 .apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:158)
 at 
 .apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
 at .apache.hadoop.fs.FileSystem.exists(FileSystem.java:648)at 
 

[jira] Commented: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-21 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793354#action_12793354
 ] 

Alan Gates commented on PIG-1141:
-

+1, changes look good.

 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793369#action_12793369
 ] 

Olga Natkovich commented on PIG-1102:
-

A few questions/comments on the patch:

(1) I think the count should default to 0, not -1.
(2) Does increment of count have to be combined with warn statement. Does this 
mean that users will see this many warnings? If so, should we combine this with 
spill message we already print?
(3) I thought we discussed having increment per buffer not per record and to 
approximate that based on the buffer size. I did not see the code that did this.
(4) I don't think you correctly separated bags that practively spill vs the 
bags that are spilled by memory manager. All the bags created by 
DefaultBagFactory get registerf with SpillableMemoryManager and belong to the 
second category.


 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1141) Make streaming work with the new load-store interfaces

2009-12-21 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath resolved PIG-1141.
-

  Resolution: Fixed
Hadoop Flags: [Incompatible change, Reviewed]

Patch committed to load-store-redesign branch, thanks Richard!

 Make streaming work with the new load-store interfaces 
 ---

 Key: PIG-1141
 URL: https://issues.apache.org/jira/browse/PIG-1141
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1141.patch, PIG-1141.patch, PIG-1141.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1167) [zebra] Zebra does not support Hadoop Globs

2009-12-21 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793419#action_12793419
 ] 

Yan Zhou commented on PIG-1167:
---

Zebra's TableLoader, the implementation of PIG's LoadFunc, does support glob by 
design. But Map/Reduce interface does not. TableLoader just expands the glob 
and passes the paths
to Map/Reduce interface so a union of underlying Zebra tables will be loaded. 
It looks like that not enough test coverage is present in this area.

 [zebra] Zebra does not support Hadoop Globs
 ---

 Key: PIG-1167
 URL: https://issues.apache.org/jira/browse/PIG-1167
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Jay Tang

 Pssing the following path to Zebra causing error but works with Hadoop 
 directly: /projects/FETL/sample/ABF1/{2009120204}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1167) [zebra] Zebra does not support Hadoop Globs

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793420#action_12793420
 ] 

Olga Natkovich commented on PIG-1167:
-

Just to clarify. Pig does support it for all loaders that use default pig 
slicer. Zebra, however, uses its own and that's why it is not getting this 
functionality for free.

 [zebra] Zebra does not support Hadoop Globs
 ---

 Key: PIG-1167
 URL: https://issues.apache.org/jira/browse/PIG-1167
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Jay Tang

 Pssing the following path to Zebra causing error but works with Hadoop 
 directly: /projects/FETL/sample/ABF1/{2009120204}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793428#action_12793428
 ] 

Hadoop QA commented on PIG-1165:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428644/PIG-1165-1.patch
  against trunk revision 892939.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/149/console

This message is automatically generated.

 Signature of loader does not set correctly for order by
 ---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1165-1.patch


 In pig, we need to set signature for each LoadFunc. Currently, we use alias 
 of the LOAD statement in Pig script of the signature of the LoadFunc. One use 
 case we have is in LoadFunc, we use signature to retrieve pruned columns of 
 each specific loader. However, in order by statement, we do not set 
 signature for the loader correctly. In this case, we do not prune the loader 
 correctly. 
 For example, the following script produce wrong result:
 {code}
 a = load '1.txt' as (a0, a1);
 b = order a by a1;
 c = order b by a1;
 d = foreach c generate a1;
 dump d;
 {code}
 1.txt:
 {code}
 1   a
 2   b
 3   c
 6   d
 5   e
 {code}
 expected result:
 a
 b
 c
 d
 e
 current result:
 1
 2
 3
 5
 6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793434#action_12793434
 ] 

Olga Natkovich commented on PIG-1165:
-

+1; changes look good!

 Signature of loader does not set correctly for order by
 ---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1165-1.patch


 In pig, we need to set signature for each LoadFunc. Currently, we use alias 
 of the LOAD statement in Pig script of the signature of the LoadFunc. One use 
 case we have is in LoadFunc, we use signature to retrieve pruned columns of 
 each specific loader. However, in order by statement, we do not set 
 signature for the loader correctly. In this case, we do not prune the loader 
 correctly. 
 For example, the following script produce wrong result:
 {code}
 a = load '1.txt' as (a0, a1);
 b = order a by a1;
 c = order b by a1;
 d = foreach c generate a1;
 dump d;
 {code}
 1.txt:
 {code}
 1   a
 2   b
 3   c
 6   d
 5   e
 {code}
 expected result:
 a
 b
 c
 d
 e
 current result:
 1
 2
 3
 5
 6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1165) Signature of loader does not set correctly for order by

2009-12-21 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1165:


  Resolution: Fixed
Hadoop Flags: [Reviewed]
  Status: Resolved  (was: Patch Available)

Patch committed to both trunk and 0.6 branch.

 Signature of loader does not set correctly for order by
 ---

 Key: PIG-1165
 URL: https://issues.apache.org/jira/browse/PIG-1165
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.6.0

 Attachments: PIG-1165-1.patch


 In pig, we need to set signature for each LoadFunc. Currently, we use alias 
 of the LOAD statement in Pig script of the signature of the LoadFunc. One use 
 case we have is in LoadFunc, we use signature to retrieve pruned columns of 
 each specific loader. However, in order by statement, we do not set 
 signature for the loader correctly. In this case, we do not prune the loader 
 correctly. 
 For example, the following script produce wrong result:
 {code}
 a = load '1.txt' as (a0, a1);
 b = order a by a1;
 c = order b by a1;
 d = foreach c generate a1;
 dump d;
 {code}
 1.txt:
 {code}
 1   a
 2   b
 3   c
 6   d
 5   e
 {code}
 expected result:
 a
 b
 c
 d
 e
 current result:
 1
 2
 3
 5
 6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1153:
--

Status: Open  (was: Patch Available)

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch, PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-21 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1153:
--

Status: Patch Available  (was: Open)

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch, PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1164) [zebra]smoke test

2009-12-21 Thread Jing Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Huang updated PIG-1164:


Attachment: (was: smoke.patch)

 [zebra]smoke test
 -

 Key: PIG-1164
 URL: https://issues.apache.org/jira/browse/PIG-1164
 Project: Pig
  Issue Type: Test
Affects Versions: 0.6.0
Reporter: Jing Huang
 Fix For: 0.7.0


 Change zebra build.xml file to add smoke target. 
 And env.sh and run script under zebra/src/test/smoke dir

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1168) Dump produces wrong results

2009-12-21 Thread Ankur (JIRA)
Dump produces wrong results
---

 Key: PIG-1168
 URL: https://issues.apache.org/jira/browse/PIG-1168
 Project: Pig
  Issue Type: Bug
Reporter: Ankur


For a map-only job, dump just re-executes every pig-latin statement from the 
begininng assuming that they would produce same result. the assumption is not 
valid if there are UDFs that are invoked. Consider the following script:-

raw = LOAD '$input' USING PigStorage() AS (text_string:chararray);
DUMP raw;

ccm = FOREACH raw GENERATE MyUDF(text_string);
DUMP ccm;

bug = FOREACH ccm GENERATE ccmObj;

DUMP bug;

The UDF MyUDF generates a tuple with one of the fields being a randomly 
generated UUID. So even though one would expect relations 'ccm' and 'bug' to 
contain identical data, they are different because of re-execution from the 
begininng. This breaks the application logic.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1153) [zebra] spliting columns at different levels in a complex record column into different column groups throws exception

2009-12-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793522#action_12793522
 ] 

Hadoop QA commented on PIG-1153:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428683/PIG-1153.patch
  against trunk revision 893053.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 5 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/150/console

This message is automatically generated.

 [zebra] spliting columns at different levels in a complex record column into 
 different column groups throws exception
 -

 Key: PIG-1153
 URL: https://issues.apache.org/jira/browse/PIG-1153
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Xuefu Zhang
Assignee: Yan Zhou
 Attachments: PIG-1153.patch, PIG-1153.patch


 The following code sample:
   String strSch = r1:record(f1:int, f2:int), r2:record(f5:int, 
 r3:record(f3:float, f4));
   String strStorage = [r1.f1, r2.r3.f3, r2.f5]; [r1.f2, r2.r3.f4];
   Partition p = new Partition(schema.toString(), strStorage, null);
 gives the following exception:
 org.apache.hadoop.zebra.parser.ParseException: Different Split Types Set 
 on the same field: r2.f5

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.