[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904551#action_12904551
 ] 

Jeff Zhang commented on PIG-794:


I did some experiment on Avro, Avro_Storage_2.patch is the detail 
implementation.

Here I use avro as the data storage between map reduce jobs to replace 
InterStorage which has been optimized compared to BinStorage. 
 I use a simple pig script which will been translate into 2 mapred jobs
{code}
a = load '/a.txt';
b = load '/b.txt';
c = join a by $0, b by $0;
d = group c by $0;
dump d;
{code}

The following table shows my experiment result (1 master + 3 slaves)
|| Storage || Time spent on job_1 || Output size of job_1 || Mapper task number 
of job_2 || Time spent on job_2 || Total spent time on pig script
| AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| 
| InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec|

The experiment shows that AvroStorage has more compact format than InterStorage 
( according the output size of job_1), but has more overhead on serialization ( 
according the time spent on job_1). I think the time spent on job_2 using 
AvroStorage is less than that using InterStorage is because the input size of 
job_2 (the output of job_1) which using AvroStorage is much less than that 
using InterStorage, so it need less mapper task.

Overall, AvroStorage is not so good as expected.
One reason is maybe I do not use Avro's API correctly (hope avro guys can 
review my code), another reason is maybe avro's serialization performance is 
not so good.
BTW, I use avro trunk.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated PIG-794:
---

Attachment: AvroStorage_3.patch

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, have you checkoed out Scott Carey's work here: 
https://issues.apache.org/jira/browse/AVRO-592 ?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904674#action_12904674
 ] 

Scott Carey commented on PIG-794:
-

AVRO-592 creates an AvroStorage class for writing and reading M/R inputs and 
outputs but does not deal with intermediate M/R output.  I have some updates to 
that in progress that simplify it more.   Some aspects may be re-usable for 
this too.   

One thing to note is that Avro cannot be completely optimal for intermediate 
M/R output because the Hadoop API for this has a performance flaw that prevents 
efficient use of buffers and input/output streams there.  This would affect 
InterStorage as well though.

I'll take a look at the patch here and see if I can see any performance 
optimizations.
Note, that there are still several performance optimizations left to do in Avro 
itself.  For example, the BinaryDecoder has been optimized, but not the Encoder 
yet.

Also, I'm somewhat blocked with AVRO-592 due to lack of Pig 0.7 maven 
availability. 



 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904680#action_12904680
 ] 

Scott Carey commented on PIG-794:
-

So a summary of the differences I can see quickly are:

h5. Schema usage:
This creates a 'generic' Avro schema that can be used for any pig data.  Each 
field in a Tuple is a Union of all possible pig types, and each Tuple is a list 
of fields.  It does not preserve the field names or types -- these are not 
important for intermediate data anyway.

AVRO-592 translates the Pig schema into a specific Avro schema that persists 
the field names and types, so that:
STORE foo INTO 'file' USING AvroStorage();
Will create a file that
foo2 = LOAD 'file' USING AvroStorage(); 
will be able to re-create the exact schema for use in a script.

h5. Serialization and Deserialization:
This uses the same style as Avro's GenericRecord, which traverses the schema on 
the fly and writes fields for each record.

AVRO-592 constructs a state machine for each specific schema to optimally 
traverse a Tuple to serialize a record or create a Tuple when deserializing.  
This should be faster but the code is definitely harder to read (but easy to 
unit test -- AVRO-592 has 98% unit test code coverage on that portion).


Integrating these should not be too hard.  I'll try and put my latest version 
of AVRO-592 up there late today or tomorrow.




 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904687#action_12904687
 ] 

Doug Cutting commented on PIG-794:
--

A few comments about the attached code:
 - is there a reason you don't subclass GenericDatumReader and 
GenericDatumWriter, overriding readRecord() and writeRecord()?  That would 
simplify things and better guarantee that you're conforming to a schema.  
Currently, e.g., your writeMap() doesn't appear to write a valid Avro map, 
writeArray() doesn't write a valid Avro array, etc., so the data written is not 
interoperable,.
 - my guess is that a lot of time is spent in findSchemaIndex().  if that's 
right, you might improve this in various ways, e.g.:
 -- sort this by the most common types.  the order in Pig's DataType.java is 
probably a good one.
 -- try using a static MapClass,Integer cache of indexes
- have you run this under a profiler?

I don't see where this specifies an Avro schema for Pig data.  It's possible to 
construct a generic schema for all Pig data.  In this, a Bag should be record 
with a single field, an array of Tuples.  A Tuple should be a record with a 
single field, an array of a union of all types.  Given such a schema, one could 
then write a DatumReader/Writer using the control logic of Pig's 
DataReaderWriter (i.e., a switch based on the value of DataType.findType(), 
but, instead of calling DataInput/Output methods, use Encoder/Decoder methods 
with a ValidatingEncoder (at least while debugging) to ensure you conform to 
that schema.

Alternately, in Avro 1.4 (snapshot in Maven now, release this week, hopefully) 
Avro arrays can be arbitrary Collection implementations.  Bag already 
implements all of the required Collection methods -- clear(), add(), size(),  
iterator(), so there's no reason I can see for Bag not to implement 
CollectionTuple.  So then one could subclass GenericData, GenericDatumReader 
 Writer, overriding:

{code}
protected boolean isRecord(Object datum) {
  return datum instanceof Tuple || datum instanceof Bag;
}
protected void writeRecord(Schema schema, Object datum, Encoder out) throws 
IOException {
  if (TUPLE_NAME.equals(schema.getFullName()))
datum = ((Tuple)datum.getAll();
  writeArray(schema.getFields().get(0).getType(), datum, out);
}
protected Object readRecord(Object old, Schema expected, ResolvingDecoder in) 
throws IOException {
  Object result;
  if (TUPLE_NAME.equals(schema.getFullName())) {
old = new ArrayList();
result = new Tuple(old);
  } else {
old = result = new Bag();
  }
  readArray(old, expected.getFields().get(0).getType(), in);
  return result;
}
{code}
   
Finally, if you knew the schema for the dataset being processed, rather than 
using a fully-general Pig schema, then you could translate that schema to an 
Avro schema.  This schema in most cases would not likely have a huge, 
compute-intensive-to-write union in it .  Or you might use something like what 
Scott's proposed in AVRO-592.


 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1429) Add Boolean Data Type to Pig

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1429:


Fix Version/s: (was: 0.8.0)

Unlinking because we are branching for release today

 Add Boolean Data Type to Pig
 

 Key: PIG-1429
 URL: https://issues.apache.org/jira/browse/PIG-1429
 Project: Pig
  Issue Type: New Feature
  Components: data
Affects Versions: 0.7.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Attachments: working_boolean.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 Pig needs a Boolean data type.  Pig-1097 is dependent on doing this.  
 I volunteer.  Is there anything beyond the work in src/org/apache/pig/data/ 
 plus unit tests to make this work?  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1549) Provide utility to construct CNF form of predicates

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1549:


Fix Version/s: (was: 0.8.0)

Unlinking from 0.8 release since we are about to branch

 Provide utility to construct CNF form of predicates
 ---

 Key: PIG-1549
 URL: https://issues.apache.org/jira/browse/PIG-1549
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Swati Jain
Assignee: Swati Jain
 Attachments: 0001-Add-CNF-utility-class.patch


 Provide utility to construct CNF form of predicates

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1530.
-

Resolution: Duplicate

Xuefu is addressing this issue as part of 
https://issues.apache.org/jira/browse/PIG-1575.

  PIG Logical Optimization: Push LOFilter above LOCogroup
 

 Key: PIG-1530
 URL: https://issues.apache.org/jira/browse/PIG-1530
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Swati Jain
Assignee: Swati Jain
Priority: Minor
 Fix For: 0.8.0


 Consider the following:
 {noformat}
 A = load 'any file' USING PigStorage(',') as (a1:int,a2:int,a3:int);
 B = load 'any file' USING PigStorage(',') as (b1:int,b2:int,b3:int);
 G = COGROUP A by (a1,a2) , B by (b1,b2);
 D = Filter G by group.$0 + 5  group.$1;
 explain D;
 {noformat}
 In the above example, LOFilter can be pushed above LOCogroup. Note there are 
 some tricky NULL issues to think about when the Cogroup is not of type INNER 
 (Similar to issues that need to be thought through when pushing LOFilter on 
 the right side of a LeftOuterJoin).
 Also note that typically the LOFilter in user programs will be below a 
 ForEach-Cogroup pair. To make this really useful, we need to also implement 
 LOFilter pushed across ForEach. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1494) PIG Logical Optimization: Use CNF in PushUpFilter

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1494:



Unlinking from 0.8 since we are about to branch for release

 PIG Logical Optimization: Use CNF in PushUpFilter
 -

 Key: PIG-1494
 URL: https://issues.apache.org/jira/browse/PIG-1494
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Swati Jain
Assignee: Swati Jain
Priority: Minor

 The PushUpFilter rule is not able to handle complicated boolean expressions.
 For example, SplitFilter rule is splitting one LOFilter into two by AND. 
 However it will not be able to split LOFilter if the top level operator is 
 OR. For example:
 *ex script:*
 A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
 B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
 C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
 J1 = JOIN B by b1, C by c1;
 J2 = JOIN J1 by $0, A by a1;
 D = *Filter J2 by ( (c1  10) AND (a3+b3  10) ) OR (c2 == 5);*
 explain D;
 In the above example, the PushUpFilter is not able to push any filter 
 condition across any join as it contains columns from all branches (inputs). 
 But if we convert this expression into Conjunctive Normal Form (CNF) then 
 we would be able to push filter condition c1 10 and c2 == 5 below both join 
 conditions. Here is the CNF expression for highlighted line:
 ( (c1  10) OR (c2 == 5) ) AND ( (a3+b3  10) OR (c2 ==5) )
 *Suggestion:* It would be a good idea to convert LOFilter's boolean 
 expression into CNF, it would then be easy to push parts (conjuncts) of the 
 LOFilter boolean expression selectively. We would also not require rule 
 SplitFilter anymore if we were to add this utility to rule PushUpFilter 
 itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)
upgrade commons-logging version with ivy


 Key: PIG-1582
 URL: https://issues.apache.org/jira/browse/PIG-1582
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan


to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


Attachment: pig-1582.patch

 upgrade commons-logging version with ivy
 

 Key: PIG-1582
 URL: https://issues.apache.org/jira/browse/PIG-1582
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-1582.patch


 to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


Status: Patch Available  (was: Open)

 upgrade commons-logging version with ivy
 

 Key: PIG-1582
 URL: https://issues.apache.org/jira/browse/PIG-1582
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Attachments: pig-1582.patch


 to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1582) upgrade commons-logging version with ivy

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1582:


   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

 upgrade commons-logging version with ivy
 

 Key: PIG-1582
 URL: https://issues.apache.org/jira/browse/PIG-1582
 Project: Pig
  Issue Type: Improvement
  Components: build
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: pig-1582.patch


 to upgrade the commons-logging version for pig from 1.0.3 to 1.1.1

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)
piggybank unit test TestLookupInFiles is broken
---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0
 Attachments: PIG-1583-1.patch

Error message:
10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
attempt_20100831093139211_0001_m_00_3: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
[LookupInFiles : Cannot open file one]
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.IOException: LookupInFiles : Cannot open file one
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
... 10 more
Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
does not exist
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
at 
org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
at 
org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
... 13 more


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: (was: PIG-1583-1.patch)

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1583:


Attachment: PIG-1583-1.patch

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904783#action_12904783
 ] 

Xuefu Zhang commented on PIG-1583:
--

+1 Patch Looks Good.

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1583:
-

Status: Patch Available  (was: Open)

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904785#action_12904785
 ] 

Olga Natkovich commented on PIG-1506:
-

This is what we need to document:

In the case of GROUP/COGROUP, the data with NULL key from the same input is 
grouped together. For instance:

Input data:

joe 5   2.5
sam 3.0
bob 3.5

script:

A = load 'small' as (name, age, gpa);
B = group A by age;
dump B;

Output:

(5,{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)})

Note that both records with null age are grouped together.

However, data with null keys from different inputs is considered different and 
will generate multiple tuples in case of cogroup. For instance:

Input: Self cogroup on the same input.

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = cogroup A by age, B by age;
dump C;

Output:

(5,{(joe,5,2.5)},{(joe,5,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

Note that there are 2 tuples in the output corresponding to the null key: one 
that contains tuples from the first input (with no much from the second) and 
one the other way around.

JOIN adds another interesting twist to this because it follows SQL standard 
which means that JOIN by default represents inner join which through away all 
the nulls.

Input: the same as for COGROUP

Script:

A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by age, B by age;
dump C;

Output:

(joe,5,2.5,joe,5,2.5)

Note that all tuples that had NULL key got filtered out.


 Need to clarify the difference between null handling in JOIN and COGROUP
 

 Key: PIG-1506
 URL: https://issues.apache.org/jira/browse/PIG-1506
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Olga Natkovich
Assignee: Corinne Chandel
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1399) Logical Optimizer: Expression optimizor rule

2010-08-31 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904808#action_12904808
 ] 

Alan Gates commented on PIG-1399:
-

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 6 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] 

 Logical Optimizer: Expression optimizor rule
 

 Key: PIG-1399
 URL: https://issues.apache.org/jira/browse/PIG-1399
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: newPatchFindbugsWarnings.html, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, PIG-1399.patch, 
 PIG-1399.patch, PIG-1399.patch


 We can optimize expression in several ways:
 1. Constant pre-calculation
 Example:
 B = filter A by a0  5+7;
 = B = filter A by a0  12;
 2. Boolean expression optimization
 Example:
 B = filter A by not (not(a05) or a10);
 = B = filter A by a05 and a=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Scott Carey (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904819#action_12904819
 ] 

Scott Carey commented on PIG-1506:
--

The SQL behavior of the above for an outer join would be to have five rows 
output -- just like COGROUP would if flattened.  So that seems fine to me.  A 
self-join should be the same as a COGROUP with yourself, which is different 
than a simple GROUP.

However, there is a problem with inner join and nulls.
Pig JOIN is not like SQL with respect to nulls on multi-column joins.  (I have 
not tried on trunk however)

In SQL, if ANY of the columns in a multi-column join is null, the row is not 
output. 

Try:

{code}
A = load 'small' as (name, age, gpa);
B = load 'small' as (name, age, gpa);
C = join A by (name,age), B by (name,age);
dump C;
{code}

The result for SQL would be one row of the form 
joe 5 2.5 joe 5 2.5



 Need to clarify the difference between null handling in JOIN and COGROUP
 

 Key: PIG-1506
 URL: https://issues.apache.org/jira/browse/PIG-1506
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Olga Natkovich
Assignee: Corinne Chandel
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1584) deal with inner cogroup

2010-08-31 Thread Olga Natkovich (JIRA)
deal with inner cogroup
---

 Key: PIG-1584
 URL: https://issues.apache.org/jira/browse/PIG-1584
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
 Fix For: 0.9.0


The current implementation of inner in case of cogroup is in conflict with 
join. We need to decide of whether to fix inner cogroup or just remove the 
functionality if it is not widely used

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1583:


Status: Open  (was: Patch Available)

submitting to hudson 

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1583) piggybank unit test TestLookupInFiles is broken

2010-08-31 Thread Giridharan Kesavan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giridharan Kesavan updated PIG-1583:


Status: Patch Available  (was: Open)

 piggybank unit test TestLookupInFiles is broken
 ---

 Key: PIG-1583
 URL: https://issues.apache.org/jira/browse/PIG-1583
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1583-1.patch


 Error message:
 10/08/31 09:32:12 INFO mapred.TaskInProgress: Error from 
 attempt_20100831093139211_0001_m_00_3: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.piggybank.evaluation.string.LookupInFiles 
 [LookupInFiles : Cannot open file one]
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:262)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:283)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: java.io.IOException: LookupInFiles : Cannot open file one
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:92)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:115)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.exec(LookupInFiles.java:49)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 ... 10 more
 Caused by: java.io.IOException: hdfs://localhost:47453/user/hadoopqa/one 
 does not exist
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:224)
 at 
 org.apache.pig.impl.io.FileLocalizer.openDFSFile(FileLocalizer.java:172)
 at 
 org.apache.pig.piggybank.evaluation.string.LookupInFiles.init(LookupInFiles.java:89)
 ... 13 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1506) Need to clarify the difference between null handling in JOIN and COGROUP

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904829#action_12904829
 ] 

Olga Natkovich commented on PIG-1506:
-

I verified that 0.8 code does deal correctly with multi-column keys with nulls

 Need to clarify the difference between null handling in JOIN and COGROUP
 

 Key: PIG-1506
 URL: https://issues.apache.org/jira/browse/PIG-1506
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Reporter: Olga Natkovich
Assignee: Corinne Chandel
 Fix For: 0.8.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1501:
--

Release Note: 
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

  was:
This feature will save HDFS space used to store the intermediate data used by 
PIG and potentially improve query execution speed. In general, the more 
intermediate data generated, the more storage and speedup benefits.

There are no backward compatibility issues as result of this feature.

Two java properties are used to control the behavoir:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not.  If true, then

pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.


An example is the following test.pig script:

register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';

which is launched as follows:

java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig

[ Show ยป ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save 
HDFS space used to store the intermediate data used by PIG and potentially 
improve query execution speed. In general, the more intermediate data 
generated, the more storage and speedup benefits. There are no backward 
compatibility issues as result of this feature. An example is the following 
test.pig script: register pigperf.jar; A = load 
'/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info, 
page_links); B1 = filter A by timespent == 4; B = load 
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by 
query_term, B by query_term using 'skewed' parallel 300; D = distinct C 
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp 
/grid/0/gs/conf/current:/grid/0/jars/pig.jar 
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32 
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo 
org.apache.pig.Main ./test.pig 



 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well 

[jira] Created: (PIG-1585) Add new properties to help and documentation

2010-08-31 Thread Olga Natkovich (JIRA)
Add new properties to help and documentation


 Key: PIG-1585
 URL: https://issues.apache.org/jira/browse/PIG-1585
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0


New properties:

Compression:

pig.tmpfilecompression, default to false, tells if the temporary files should 
be compressed or not. If true, then 
pig.tmpfilecompression.codec specifies which compression codec to use. 
Currently, PIG only accepts gz and lzo as possible values. Since LZO is 
under GPL license, Hadoop may need to be configured to use LZO codec. Please 
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details. 

Combining small files:

pig.noSplitCombination - disables combining multiple small files to the block 
size


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1572) change default datatype when relations are used as scalar to bytearray

2010-08-31 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1572:
---

Attachment: PIG-1572.2.patch

PIG-1572.2.patch 
- Fixed loss of lineage information in translation during explain call
- Added cast on output of ReadScalars so that type information is not lost 
during schema reset from optimizer.

Unit tests and test-patch has passed. Patch is ready for review.

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 change default datatype when relations are used as scalar to bytearray
 --

 Key: PIG-1572
 URL: https://issues.apache.org/jira/browse/PIG-1572
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1572.1.patch, PIG-1572.2.patch


 When relations are cast to scalar, the current default type is chararray. 
 This is inconsistent with the behavior in rest of pig-latin.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-31 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904848#action_12904848
 ] 

Olga Natkovich commented on PIG-1501:
-

Ashutosh,

The reason it is off by default is because the default compression is gzip 
which is really slow and most of the time not what you want. Because of the 
licensing issue with lzo, users need to setup it on their own. Once they do the 
setup, they can enable the compression.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch, PIG-1501.patch, PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)
Parameter subsitution using -param option runs into problems when substituing 
entire pig statements in a shell script (maybe this is a bash problem)


 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat


I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1586:


Description: 
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

{code}
register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj

  was:
I have a Pig script as a template:

{code}
register Countwords.jar;
A = $INPUT;
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);
STORE D INTO $OUTPUT;
{code}


I attempt to do Parameter substitutions using the following:

Using Shell script:

{code}
#!/bin/bash
java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r -file 
sub.pig \
 -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' USING 
PigStorage() AS (word:chararray,num:int)) by (word),(load 
'/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
(word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
 -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
{code}

register Countwords.jar;

A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
(word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
PigStorage() AS (word:chararray,num:int)) by (word)) generate 
flatten(examples.udf.CountWords(runsub.sh,,)));
B = FOREACH A GENERATE
examples.udf.SubString($0,0,1),
$1 as num;
C = GROUP B BY $0;
D = FOREACH C GENERATE group, SUM(B.num);

STORE D INTO /user/viraj/output;
{code}

The shell substitutes the $0 before passing it to java. 
a) Is there a workaround for this?  
b) Is this is Pig param problem?


Viraj




 Parameter subsitution using -param option runs into problems when substituing 
 entire pig statements in a shell script (maybe this is a bash problem)
 

 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat

 I have a Pig script as a template:
 {code}
 register Countwords.jar;
 A = $INPUT;
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO $OUTPUT;
 {code}
 I attempt to do Parameter substitutions using the following:
 Using Shell script:
 {code}
 #!/bin/bash
 java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
 -file sub.pig \
  -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' 
 USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
 '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
 (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
  -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
 {code}
 {code}
 register Countwords.jar;
 A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
 (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
 PigStorage() AS (word:chararray,num:int)) by (word)) generate 
 flatten(examples.udf.CountWords(runsub.sh,,)));
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO /user/viraj/output;
 {code}
 The shell substitutes the $0 before passing it to java. 
 a) Is there a workaround for this?  
 b) Is this is Pig param problem?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a 

[jira] Created: (PIG-1587) Cloning utility functions for new logical plan

2010-08-31 Thread Daniel Dai (JIRA)
Cloning utility functions for new logical plan
--

 Key: PIG-1587
 URL: https://issues.apache.org/jira/browse/PIG-1587
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.9.0


We sometimes need to copy a logical operator/plan when writing an optimization 
rule. Currently copy an operator/plan is awkward. We need to write some 
utilities to facilitate this process. Swati contribute PIG-1510 but we feel it 
still cannot address most use cases. I propose to add some more utilities into 
new logical plan:

all LogicalExpressions:
{code}
copy(LogicalExpressionPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical expression operator (except for fieldSchema, 
uidOnlySchema, ProjectExpression.attachedRelationalOp)
* Set the plan to newPlan
* If keepUid is true, further copy uidOnlyFieldSchema

all LogicalRelationalOperators:
{code}
copy(LogicalPlan newPlan, boolean keepUid);
{code}
* Do a shallow copy of the logical relational operator (except for schema, uid 
related fields)
* Set the plan to newPlan;
* If the operator have inner plan/expression plan, copy the whole inner plan 
with the same keepUid flag (Especially, LOInnerLoad will copy its inner 
project, with the same keepUid flag)
* If keepUid is true, further copy uid related fields (LOUnion.uidMapping, 
LOCogroup.groupKeyUidOnlySchema, LOCogroup.generatedInputUids)

LogicalExpressionPlan.java
{code}
LogicalExpressionPlan copy(LogicalRelationalOperator attachedRelationalOp, 
boolean keepUid);
{code}
* Copy expression operator along with connection with the same keepUid flag
* Set all ProjectExpression.attachedRelationalOp to attachedRelationalOp 
parameter

{code}
ListOperator merge(LogicalExpressionPlan plan);
{code}
* Merge plan into the current logical expression plan as an independent tree
* return the sources of this independent tree


LogicalPlan.java
{code}
LogicalPlan copy(boolean keepUid);
{code}
* Main use case to copy inner plan of ForEach
* Copy all relational operator along with connection
* Copy all expression plans inside relational operator, set plan and 
attachedRelationalOp properly

{code}
ListOperator merge(LogicalPlan plan);
{code}
* Merge plan into the current logical plan as an independent tree
* return the sources of this independent tree


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1586) Parameter subsitution using -param option runs into problems when substituing entire pig statements in a shell script (maybe this is a bash problem)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1586:
---

Assignee: Viraj Bhat

Viraj volunteered to print the line that pig gets as part of parameter 
substitution to see if the escapes and quotes are eaten by the shell. Thanks 
Viraj

 Parameter subsitution using -param option runs into problems when substituing 
 entire pig statements in a shell script (maybe this is a bash problem)
 

 Key: PIG-1586
 URL: https://issues.apache.org/jira/browse/PIG-1586
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Viraj Bhat
Assignee: Viraj Bhat

 I have a Pig script as a template:
 {code}
 register Countwords.jar;
 A = $INPUT;
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO $OUTPUT;
 {code}
 I attempt to do Parameter substitutions using the following:
 Using Shell script:
 {code}
 #!/bin/bash
 java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -r 
 -file sub.pig \
  -param INPUT=(foreach (COGROUP(load '/user/viraj/dataset1' 
 USING PigStorage() AS (word:chararray,num:int)) by (word),(load 
 '/user/viraj/dataset2' USING PigStorage() AS (word:chararray,num:int)) by 
 (word)) generate flatten(examples.udf.CountWords(\\$0,\\$1,\\$2))) \
  -param OUTPUT=\'/user/viraj/output\' USING PigStorage()
 {code}
 {code}
 register Countwords.jar;
 A = (foreach (COGROUP(load '/user/viraj/dataset1' USING PigStorage() AS 
 (word:chararray,num:int)) by (word),(load '/user/viraj/dataset2' USING 
 PigStorage() AS (word:chararray,num:int)) by (word)) generate 
 flatten(examples.udf.CountWords(runsub.sh,,)));
 B = FOREACH A GENERATE
 examples.udf.SubString($0,0,1),
 $1 as num;
 C = GROUP B BY $0;
 D = FOREACH C GENERATE group, SUM(B.num);
 STORE D INTO /user/viraj/output;
 {code}
 The shell substitutes the $0 before passing it to java. 
 a) Is there a workaround for this?  
 b) Is this is Pig param problem?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-08-31 Thread Laukik Chitnis (JIRA)
Parameter pre-processing of values containing pig positional variables ($0, $1 
etc)
---

 Key: PIG-1588
 URL: https://issues.apache.org/jira/browse/PIG-1588
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Laukik Chitnis
 Fix For: 0.7.0


Pig 0.7 requires the positional variables to be escaped by a \\ when passed as 
part of a parameter value (either through cmd line param or through 
param_file), which was not the case in Pig 0.6 Assuming that this was not an 
intended breakage of backward compatibility (could not find it in release 
notes), this would be a bug.

For example, We need to pass
INPUT=CountWords(\\$0,\\$1,\\$2)

instead of simply
INPUT=CountWords($0,$1,$2)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1588) Parameter pre-processing of values containing pig positional variables ($0, $1 etc)

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1588.
-

Resolution: Duplicate

This is duplicate of https://issues.apache.org/jira/browse/PIG-1586 and at this 
point we do not believe that either is a bug in pig. Viraj is verifying that 
but we think that shell removes the escapes before giving it to Pig

 Parameter pre-processing of values containing pig positional variables ($0, 
 $1 etc)
 ---

 Key: PIG-1588
 URL: https://issues.apache.org/jira/browse/PIG-1588
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Laukik Chitnis
 Fix For: 0.7.0


 Pig 0.7 requires the positional variables to be escaped by a \\ when passed 
 as part of a parameter value (either through cmd line param or through 
 param_file), which was not the case in Pig 0.6 Assuming that this was not an 
 intended breakage of backward compatibility (could not find it in release 
 notes), this would be a bug.
 For example, We need to pass
 INPUT=CountWords(\\$0,\\$1,\\$2)
 instead of simply
 INPUT=CountWords($0,$1,$2)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-1537.
-

Resolution: Fixed

 Column pruner causes wrong results when using both Custom Store UDF and 
 PigStorage
 --

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.8.0


 I have script which is of this pattern and it uses 2 StoreFunc's:
 {code}
 register loader.jar
 register piggy-bank/java/build/storage.jar;
 %DEFAULT OUTPUTDIR /user/viraj/prunecol/
 ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
 ss_sc_filtered_0 = FILTER ss_sc_0 BY
 a#'id' matches '1.*' OR
 a#'id' matches '2.*' OR
 a#'id' matches '3.*' OR
 a#'id' matches '4.*';
 ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);
 ss_sc_filtered_1 = FILTER ss_sc_1 BY
 a#'id' matches '65.*' OR
 a#'id' matches '466.*' OR
 a#'id' matches '043.*' OR
 a#'id' matches '044.*' OR
 a#'id' matches '0650.*' OR
 a#'id' matches '001.*';
 ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;
 ss_sc_all_proj = FOREACH ss_sc_all GENERATE
 a#'query' as query,
 a#'testid' as testid,
 a#'timestamp' as timestamp,
 a,
 b,
 c;
 ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;
 ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;
 STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();
 ss_sc_all_map_count = group ss_sc_all_map all;
 count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
 record_count,COUNT($1);
 STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
 {code}
 I run this script using:
 a) java -cp pig0.7.jar script.pig
 b) java -cp pig0.7.jar -t PruneColumns script.pig
 What I observe is that the alias count produces the same number of records 
 but ss_sc_all_map have different sizes when run with above 2 options.
 Is due to the fact that there are 2 store func's used?
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-747) Logical to Physical Plan Translation fails when temporary alias are created within foreach

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-747:
---

Fix Version/s: 0.9.0
   (was: 0.8.0)

 Logical to Physical Plan Translation fails when temporary alias are created 
 within foreach
 --

 Key: PIG-747
 URL: https://issues.apache.org/jira/browse/PIG-747
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.4.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.9.0

 Attachments: physicalplan.txt, physicalplanprob.pig, PIG-747-1.patch


 Consider a the pig script which calculates a new column F inside the foreach 
 as:
 {code}
 A = load 'physicalplan.txt' as (col1,col2,col3);
 B = foreach A {
D = col1/col2;
E = col3/col2;
F = E - (D*D);
generate
F as newcol;
 };
 dump B;
 {code}
 This gives the following error:
 ===
 Caused by: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogicalToPhysicalTranslatorException:
  ERROR 2015: Invalid physical operators in the physical plan
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:377)
 at 
 org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:63)
 at 
 org.apache.pig.impl.logicalLayer.LOMultiply.visit(LOMultiply.java:29)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalkerWOSeenChk.walk(DependencyOrderWalkerWOSeenChk.java:68)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:908)
 at 
 org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:122)
 at org.apache.pig.impl.logicalLayer.LOForEach.visit(LOForEach.java:41)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:68)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:246)
 ... 10 more
 Caused by: org.apache.pig.impl.plan.PlanException: ERROR 0: Attempt to give 
 operator of type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.Divide
  multiple outputs.  This operator does not support multiple outputs.
 at 
 org.apache.pig.impl.plan.OperatorPlan.connect(OperatorPlan.java:158)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.plans.PhysicalPlan.connect(PhysicalPlan.java:89)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:373)
 ... 19 more
 ===

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1319) New logical optimization rules

2010-08-31 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1319:


Fix Version/s: 0.9.0
   (was: 0.8.0)

 New logical optimization rules
 --

 Key: PIG-1319
 URL: https://issues.apache.org/jira/browse/PIG-1319
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.9.0


 In [PIG-1178|https://issues.apache.org/jira/browse/PIG-1178], we build a new 
 logical optimization framework. One design goal for the new logical optimizer 
 is to make it easier to add new logical optimization rules. In this Jira, we 
 keep track of the development of these new logical optimization rules.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Does Pig Re-Use FileInputLoadFuncs Objects?

2010-08-31 Thread Russell Jurney
Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc objects?  We
suspect state is being retained between different stores, but we don't
actually know this.  Figured I'd ask to verify the hunch.

Our load func for our in-house format works fine with Pig scripts
normally... but I have a pig script that looks like this:

LOAD thing1
SPLIT thing1 INTO thing2, thing3
STORE thing2 INTO thing2
STORE thing3 INTO thing3

LOAD thing4
SPLIT thing4 INTO thing5, thing6
STORE thing5 INTO thing5
STORE thing6 INTO thing6


And it works via PigStorage, but not via our FileInputLoadFunc.

Russ