[jira] Assigned: (PIG-847) Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag

2010-09-20 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-847:
--

Assignee: Alan Gates  (was: Richard Ding)

> Setting twoLevelAccessRequired field in a bag schema should not be required 
> to access fields in the tuples of the bag
> -
>
> Key: PIG-847
> URL: https://issues.apache.org/jira/browse/PIG-847
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.3.0
>Reporter: Pradeep Kamath
>Assignee: Alan Gates
>
> Currently Pig interprets the result type of a relation as a bag. However the 
> schema of the relation directly contains the schema describing the fields in 
> the tuples for the relation. However when a udf wants to return a bag or if 
> there is a bag in input data or if the user creates a bag constant, the 
> schema of the bag has one field schema which is that of the tuple. The 
> Tuple's schema has the types of the fields. To be able to access the fields 
> from the bag directly in such a case by using something like 
> . or ., the schema of the bag should 
> have the twoLevelAccess set to true so that pig's type system can get 
> traverse the tuple schema and get to the field in question. This is confusing 
> - we should try and see if we can avoid needing this extra flag. A possible 
> solution is to treat bags the same way - whether they represent relations or 
> real bags. Another way is to introduce a special "relation" datatype for the 
> result type of a relation and bag type would be used only for true bags. In 
> this case, we would always need bag schema to have a tuple schema which would 
> describe the fields. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1371) Pig should handle deep casting of complex types

2010-09-20 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1371:
---

Assignee: Alan Gates  (was: Richard Ding)

> Pig should handle deep casting of complex types 
> 
>
> Key: PIG-1371
> URL: https://issues.apache.org/jira/browse/PIG-1371
> Project: Pig
>  Issue Type: Bug
>Reporter: Pradeep Kamath
>Assignee: Alan Gates
> Attachments: PIG-1371-partial.patch
>
>
> Consider input data in BinStorage format which has a field of bag type - 
> bg:{t:(i:int)}. In the load statement if the schema specified has the type 
> for this field specified as bg:{t:(c:chararray)}, the current behavior is 
> that Pig thinks of the field to be of type specified in the load statement 
> (bg:{t:(c:chararray)}) but no deep cast from bag of int (the real data) to 
> bag of chararray (the user specified schema) is made.
> There are two issues currently:
> 1) The TypeCastInserter only considers the byte 'type' between the loader 
> presented schema and user specified schema to decided whether to introduce a 
> cast or not. In the above case since both schema have the type "bag" no cast 
> is inserted. This check has to be extended to consider the full FieldSchema 
> (with inner subschema) in order to decide whether a cast is needed.
> 2) POCast should be changed to handle casting a complex type to the type 
> specified the user supplied FieldSchema. Here is there is one issue to be 
> considered - if the user specified the cast type to be bg:{t:(i:int, j:int)} 
> and the real data had only one field what should the result of the cast be:
>  * A bag with two fields - the int field and a null? - In this approach pig 
> is assuming the lone field in the data is the first field which might be 
> incorrect if it in fact is the second field.
>  * A null bag to indicate that the bag is of unknown value - this is the one 
> I personally prefer
>  * The cast throws an IncompatibleCastException

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1634) Multiple names for the "group" field

2010-09-20 Thread Viraj Bhat (JIRA)
Multiple names for the "group" field


 Key: PIG-1634
 URL: https://issues.apache.org/jira/browse/PIG-1634
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Viraj Bhat


I am hoping that in Pig if I type 

{quote} c = cogroup a by foo, b by bar", the fields c.group, c.foo  and c.bar 
should all map to c.$0 {quote} 

This would improve the readability  of the Pig script.

Here's a real usecase:
{code}
---
pages = LOAD 'pages.dat'  AS (url, pagerank);

visits = LOAD 'user_log.dat'  AS (user_id, url);

page_visits = COGROUP pages BY url, visits BY url;

frequent_visits = FILTER page_visits BY COUNT(visits) >= 2;

answer = FOREACH frequent_visits  GENERATE url, FLATTEN(pages.pagerank);
---
{code}

(The important part is the final GENERATE statement, which references   the 
field "url", which was the grouping field in the earlier COGROUP.)  To get it  
to work I have to write it in a less intuitive way.

Maybe with the new parser changes in Pig 0.9 it would be easier to specify that.
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour

2010-09-20 Thread Viraj Bhat (JIRA)
Using an alias withing Nested Foreach causes indeterminate behaviour


 Key: PIG-1633
 URL: https://issues.apache.org/jira/browse/PIG-1633
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0
Reporter: Viraj Bhat


I have created a RANDOMINT function which generates random numbers between (0 
and specified value), For example RANDOMINT(4) gives random numbers between 0 
and 3 (inclusive)

{code}
$hadoop fs -cat rand.dat
f
g
h
i
j
k
l
m
{code}

The pig script is as follows:
{code}
register math.jar;
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A {
r = math.RANDOMINT(4);
generate
data,
r as random,
((r == 3)?1:0) as quarter;
};

dump B;
{code}

The results are as follows:
{code}
{color:red} 
(f,0,0)
(g,3,0)
(h,0,0)
(i,2,0)
(j,3,0)
(k,2,0)
(l,0,1)
(m,1,0)
{color} 
{code}

If you observe, (j,3,0) is created because r is used both in the foreach and 
generate clauses and generate different values.

Modifying the above script to below solves the issue. The M/R jobs from both 
scripts are the same. It is just a matter of convenience. 
{code}
A = load 'rand.dat' using PigStorage() as (data);

B = foreach A generate
data,
math.RANDOMINT(4) as r;

C = foreach B generate
data,
r,
((r == 3)?1:0) as quarter;

dump C;
{code}

Is this issue related to PIG:747?
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink

2010-09-20 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated PIG-1632:
-

Status: Patch Available  (was: Open)

> The core jar in the tarball contains the kitchen sink 
> --
>
> Key: PIG-1632
> URL: https://issues.apache.org/jira/browse/PIG-1632
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Eli Collins
> Fix For: site, 0.9.0
>
> Attachments: pig-1632-1.patch
>
>
> The core jar in the tarball contains the kitchen sink, it's not the same core 
> jar built by ant jar. This is problematic since other projects that want to 
> depend on the pig core jar just want pig core, but 
> pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff 
> (hadoop, com.google, commons, etc) that may conflict with the packages also 
> on a user's classpath.
> {noformat}
> pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l
> 12
> pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz
> ...
> pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v 
> pig|wc -l
> 4819
> {noformat}
> How about restricting the core jar to just Pig classes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink

2010-09-20 Thread Eli Collins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eli Collins updated PIG-1632:
-

Attachment: pig-1632-1.patch

Attached patch updates the package target so that the tarball, and therefore 
Pig release, just contain the Pig core jar.   If a Pig release needs to bundle 
Hadoop and a bunch of other stuff perhaps we could put those jars in lib 
instead of the core jar.

Running things like the tests out of a tarball that just includes the core jar 
works as these come in via ivy, anything else that needs to be tested? 

> The core jar in the tarball contains the kitchen sink 
> --
>
> Key: PIG-1632
> URL: https://issues.apache.org/jira/browse/PIG-1632
> Project: Pig
>  Issue Type: Bug
>  Components: build
>Affects Versions: 0.8.0, 0.9.0
>Reporter: Eli Collins
> Fix For: site, 0.9.0
>
> Attachments: pig-1632-1.patch
>
>
> The core jar in the tarball contains the kitchen sink, it's not the same core 
> jar built by ant jar. This is problematic since other projects that want to 
> depend on the pig core jar just want pig core, but 
> pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff 
> (hadoop, com.google, commons, etc) that may conflict with the packages also 
> on a user's classpath.
> {noformat}
> pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l
> 12
> pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz
> ...
> pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v 
> pig|wc -l
> 4819
> {noformat}
> How about restricting the core jar to just Pig classes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1632) The core jar in the tarball contains the kitchen sink

2010-09-20 Thread Eli Collins (JIRA)
The core jar in the tarball contains the kitchen sink 
--

 Key: PIG-1632
 URL: https://issues.apache.org/jira/browse/PIG-1632
 Project: Pig
  Issue Type: Bug
  Components: build
Affects Versions: 0.8.0, 0.9.0
Reporter: Eli Collins
 Fix For: site, 0.9.0


The core jar in the tarball contains the kitchen sink, it's not the same core 
jar built by ant jar. This is problematic since other projects that want to 
depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar 
in the tarball contains a bunch of other stuff (hadoop, com.google, commons, 
etc) that may conflict with the packages also on a user's classpath.

{noformat}
pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l
12
pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz
...
pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v 
pig|wc -l
4819
{noformat}

How about restricting the core jar to just Pig classes?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-09-20 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1579:
---

Assignee: Daniel Dai

> Intermittent unit test failure for 
> TestScriptUDF.testPythonScriptUDFNullInputOutput
> ---
>
> Key: PIG-1579
> URL: https://issues.apache.org/jira/browse/PIG-1579
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.8.0
>
> Attachments: PIG-1579-1.patch
>
>
> Error message:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error 
> executing function: Traceback (most recent call last):
>   File "", line 5, in multStr
> TypeError: can't multiply sequence by non-int of type 'NoneType'
> at 
> org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
> at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
> at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
> at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1631) Support to 2 level nested foreach

2010-09-20 Thread Viraj Bhat (JIRA)
Support to 2 level nested foreach
-

 Key: PIG-1631
 URL: https://issues.apache.org/jira/browse/PIG-1631
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


What I would like to do is generate certain metrics for every listing 
impression in the context of a page like clicks on the page etc. So, I first 
group by to get clicks and impression together. Now, I would want to iterate 
through the mini-table (one per serve-id) and compute metrics. Since nested 
foreach within foreach is not supported I ended up writing a UDF that took both 
the bags and computed the metric. It would have been elegant to keep the logic 
of iterating over the records outside in the PIG script. 

Here is some pseudocode of how I would have liked to write it:

{code}
-- Let us say in our page context there was click on rank 2 for which there 
were 3 ads 
A1 = LOAD '...' AS (page_id, rank); -- clicks. 
A2 = Load '...' AS (page_id, rank); -- impressions

B = COGROUP A1 by (page_id), A2 by (page_id); 

-- Let us say B contains the following schema 
-- (group, {(A1...)} {(A2...)})  
-- Each record would be in B would be:
-- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))}

C = FOREACH B GENERATE {
D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig 
as well. Basically, I would like a mini-table which represents an entire serve. 
FOREACH D GENERATE
page_id_1,
A2:rank,
SOMEUDF(A1:rank, A2::rank);  -- This UDF returns a 
value (like v1, v2, v3 depending on A1::rank and A2::rank)
};
# output
# page_id, 1, v1
# page_id,  2, v2
# page_id, 3, v3

DUMP C;
{code}

P.S: I understand that I could have alternatively, flattened the fields of B 
and then done a GROUP on page_id and then iterated through the records calling 
'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS

2010-09-20 Thread Viraj Bhat (JIRA)
Support param_files to be loaded into HDFS
--

 Key: PIG-1630
 URL: https://issues.apache.org/jira/browse/PIG-1630
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I want to place the parameters of a Pig script in a param_file. 

But instead of this file being in the local file system where I run my java 
command, I want this to be on HDFS.

{code}
$ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile 
myscript.pig
{code}

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1461) support union operation that merges based on column names

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1461:
---

Release Note: 
Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names 
of the input relations, and not column position.
If the following requirements are not met, the statement will throw an error :

* All inputs to the union should have a non null schema.
* The data type for columns with same name in different input schemas 
should be compatible. Numeric types are compatible, and if column having same 
name in different input schemas have different numeric types , an implicit 
conversion will happen. bytearray type is considered compatible with all other 
types, a cast will be added to convert to other type. Bags or tuples having 
different inner schema are considered incompatible.

Example -

grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

Note:
- Alias such as 'nm::c1' and 'c1' in two separate relations specified in 'union 
onschema' are considered mergeable and in the schema of the union, the merged 
column alias will be 'c1'.
- Alias such as 'nm1::c1' and 'nm2::c1' in two separate relations specified in 
'union onschema' will not be merged together, in schema of the union there will 
be two columns with these names.

Example -

> describe f;
f: {l1::a: int, l1::b: int, l1::c: int}
> describe l1;
l1: {a: int, b: int}

> u = union onschema f,l1;
> desc u;
u: {a: int, b: int, l1::c: int} 

Like the default union, 'union onschema' also supports 2 or more inputs.



  was:
Documentation for UNION ONSCHEMA:

Use the keyword ONSCHEMA with union so that the union is based on column names 
of the input relations, and not column position.
If the following requirements are not met, the statement will throw an error :

* All inputs to the union should have a non null schema.
* The data type for columns with same name in different input schemas 
should be compatible. Numeric types are compatible, and if column having same 
name in different input schemas have different numeric types , an implicit 
conversion will happen. bytearray type is considered compatible with all other 
types, a cast will be added to convert to other type. Bags or tuples having 
different inner schema are considered incompatible.

Example -

grunt> L1 = load 'f1' using (a : int, b : float);
grunt> dump L1;
(11,12.0)
(21,22.0)

grunt> L2 = load 'f1' using (a : long, c : chararray);
grunt> dump L2;
(11,a)
(12,b)
(13,c)

grunt> U = union onschema L1, L2;
grunt> describe U ;
U : {a : long, b : float, c : chararray}

grunt> dump U;
(11,12.0,)
(21,22.0,)
(11,,a)
(12,,b)
(13,,c)

Like the default union, 'union onschema' also supports 2 or more inputs.




Adding release note section of PIG-1610 to this release note.


> support union operation that merges based on column names
> -
>
> Key: PIG-1461
> URL: https://issues.apache.org/jira/browse/PIG-1461
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch
>
>
> When the data has schema, it often makes sense to union on column names in 
> schema rather than the position of the columns. 
> The behavior of existing union operator should remain backward compatible .
> This feature can be supported using either a new operator or extending union 
> to support 'using' clause . I am thinking of having a new operator called 
> either unionschema or merge . Does anybody have any other suggestions for the 
> syntax ?
> example -
> L1 = load 'x' as (a,b);
> L2 = load 'y' as (b,c);
> U = unionschema L1, L2;
> describe U;
> U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1616:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk and 0.8 branch.

> 'union onschema' does not use create output with correct schema when udfs are 
> involved
> --
>
> Key: PIG-1616
> URL: https://issues.apache.org/jira/browse/PIG-1616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1616.1.patch
>
>
> 'union onshcema' creates a merged schema based on the input schemas. It does 
> that in the queryparser, and at that stage the udf return type used is the 
> default return type.  The actual return type for the udf is determined later 
> in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
> 'union onschema' should use the final type for its input relation to create 
> the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912696#action_12912696
 ] 

Richard Ding commented on PIG-1616:
---

+1

> 'union onschema' does not use create output with correct schema when udfs are 
> involved
> --
>
> Key: PIG-1616
> URL: https://issues.apache.org/jira/browse/PIG-1616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1616.1.patch
>
>
> 'union onshcema' creates a merged schema based on the input schemas. It does 
> that in the queryparser, and at that stage the udf return type used is the 
> default return type.  The actual return type for the udf is determined later 
> in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
> 'union onschema' should use the final type for its input relation to create 
> the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1617) 'group all' should always use one reducer

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1617:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk and 0.8 branch.


> 'group all' should always use one reducer
> -
>
> Key: PIG-1617
> URL: https://issues.apache.org/jira/browse/PIG-1617
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1617.1.patch
>
>
> 'group all' sends all rows to a single reducer, it does not make sense to 
> spawn more than one reducer for it. But if higher value of parallelism is 
> specified or if the input is large enough so that changes in PIG-1249 result 
> in larger value being set, there are additional reducers spawned that don't 
> do anything useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1617) 'group all' should always use one reducer

2010-09-20 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912655#action_12912655
 ] 

Olga Natkovich commented on PIG-1617:
-

Looks good. +1

> 'group all' should always use one reducer
> -
>
> Key: PIG-1617
> URL: https://issues.apache.org/jira/browse/PIG-1617
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1617.1.patch
>
>
> 'group all' sends all rows to a single reducer, it does not make sense to 
> spawn more than one reducer for it. But if higher value of parallelism is 
> specified or if the input is large enough so that changes in PIG-1249 result 
> in larger value being set, there are additional reducers spawned that don't 
> do anything useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-348) -j command line option doesn't work

2010-09-20 Thread Corinne Chandel (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corinne Chandel updated PIG-348:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

Nothing to update in docs. Closing.

> -j command line option doesn't work
> ---
>
> Key: PIG-348
> URL: https://issues.apache.org/jira/browse/PIG-348
> Project: Pig
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Amir Youssefi
>Assignee: Corinne Chandel
> Fix For: 0.8.0
>
> Attachments: PIG-348.path, PIG-348_1.patch
>
>
> According to:
> $ pig --help 
> ...
> -j, -jar jarfile load jarfile
> ...
> yet 
> $pig -j my.jar
> doesn't work in place of:
> register my.jar 
> in Pig script. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1112) FLATTEN eliminates the alias

2010-09-20 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912624#action_12912624
 ] 

Alan Gates commented on PIG-1112:
-

In the example above, the user specified that he expects two fields to come out 
of the flatten of ladder.  This seems equivalent to saying A = load 'ladder' as 
(third, second).  So I propose that when users give field names (and possibly 
types) in an AS that is attached to a flatten Pig takes that to be the schema 
of the flattened data.

> FLATTEN eliminates the alias
> 
>
> Key: PIG-1112
> URL: https://issues.apache.org/jira/browse/PIG-1112
> Project: Pig
>  Issue Type: Bug
>Reporter: Ankur
>Assignee: Alan Gates
> Fix For: 0.9.0
>
>
> If schema for a field of type 'bag' is partially defined then FLATTEN() 
> incorrectly eliminates the field and throws an error. 
> Consider the following example:-
> A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, 
> ladder:bag{});  
> B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; 
>   
> C = GROUP B by (first,third);
> This throws the error
>  ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. 
> Invalid alias: third in {first: chararray,second: chararray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1616:
---

Status: Patch Available  (was: Open)

> 'union onschema' does not use create output with correct schema when udfs are 
> involved
> --
>
> Key: PIG-1616
> URL: https://issues.apache.org/jira/browse/PIG-1616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1616.1.patch
>
>
> 'union onshcema' creates a merged schema based on the input schemas. It does 
> that in the queryparser, and at that stage the udf return type used is the 
> default return type.  The actual return type for the udf is determined later 
> in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
> 'union onschema' should use the final type for its input relation to create 
> the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1616:
---

Attachment: PIG-1616.1.patch

PIG-1616.1.patch - calls LogicalPlanValidationExecutor.validate() to set the 
actual types, before the merged schema for 'union onschema' is created.
Passes unit tests and test-patch. Ready for review.


> 'union onschema' does not use create output with correct schema when udfs are 
> involved
> --
>
> Key: PIG-1616
> URL: https://issues.apache.org/jira/browse/PIG-1616
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1616.1.patch
>
>
> 'union onshcema' creates a merged schema based on the input schemas. It does 
> that in the queryparser, and at that stage the udf return type used is the 
> default return type.  The actual return type for the udf is determined later 
> in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping().
> 'union onschema' should use the final type for its input relation to create 
> the merged schema.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1617) 'group all' should always use one reducer

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1617:
---

Status: Patch Available  (was: Open)

> 'group all' should always use one reducer
> -
>
> Key: PIG-1617
> URL: https://issues.apache.org/jira/browse/PIG-1617
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1617.1.patch
>
>
> 'group all' sends all rows to a single reducer, it does not make sense to 
> spawn more than one reducer for it. But if higher value of parallelism is 
> specified or if the input is large enough so that changes in PIG-1249 result 
> in larger value being set, there are additional reducers spawned that don't 
> do anything useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1617) 'group all' should always use one reducer

2010-09-20 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1617:
---

Attachment: PIG-1617.1.patch

PIG-1617.1.patch- Patch sets parallelism of LOCogroup to 1 for group on a 
constant (including 'group all')
Passes unit tests and test patch. Ready for review.


> 'group all' should always use one reducer
> -
>
> Key: PIG-1617
> URL: https://issues.apache.org/jira/browse/PIG-1617
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1617.1.patch
>
>
> 'group all' sends all rows to a single reducer, it does not make sense to 
> spawn more than one reducer for it. But if higher value of parallelism is 
> specified or if the input is large enough so that changes in PIG-1249 result 
> in larger value being set, there are additional reducers spawned that don't 
> do anything useful.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.