[jira] Assigned: (PIG-847) Setting twoLevelAccessRequired field in a bag schema should not be required to access fields in the tuples of the bag
[ https://issues.apache.org/jira/browse/PIG-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-847: -- Assignee: Alan Gates (was: Richard Ding) > Setting twoLevelAccessRequired field in a bag schema should not be required > to access fields in the tuples of the bag > - > > Key: PIG-847 > URL: https://issues.apache.org/jira/browse/PIG-847 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.3.0 >Reporter: Pradeep Kamath >Assignee: Alan Gates > > Currently Pig interprets the result type of a relation as a bag. However the > schema of the relation directly contains the schema describing the fields in > the tuples for the relation. However when a udf wants to return a bag or if > there is a bag in input data or if the user creates a bag constant, the > schema of the bag has one field schema which is that of the tuple. The > Tuple's schema has the types of the fields. To be able to access the fields > from the bag directly in such a case by using something like > . or ., the schema of the bag should > have the twoLevelAccess set to true so that pig's type system can get > traverse the tuple schema and get to the field in question. This is confusing > - we should try and see if we can avoid needing this extra flag. A possible > solution is to treat bags the same way - whether they represent relations or > real bags. Another way is to introduce a special "relation" datatype for the > result type of a relation and bag type would be used only for true bags. In > this case, we would always need bag schema to have a tuple schema which would > describe the fields. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1371) Pig should handle deep casting of complex types
[ https://issues.apache.org/jira/browse/PIG-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1371: --- Assignee: Alan Gates (was: Richard Ding) > Pig should handle deep casting of complex types > > > Key: PIG-1371 > URL: https://issues.apache.org/jira/browse/PIG-1371 > Project: Pig > Issue Type: Bug >Reporter: Pradeep Kamath >Assignee: Alan Gates > Attachments: PIG-1371-partial.patch > > > Consider input data in BinStorage format which has a field of bag type - > bg:{t:(i:int)}. In the load statement if the schema specified has the type > for this field specified as bg:{t:(c:chararray)}, the current behavior is > that Pig thinks of the field to be of type specified in the load statement > (bg:{t:(c:chararray)}) but no deep cast from bag of int (the real data) to > bag of chararray (the user specified schema) is made. > There are two issues currently: > 1) The TypeCastInserter only considers the byte 'type' between the loader > presented schema and user specified schema to decided whether to introduce a > cast or not. In the above case since both schema have the type "bag" no cast > is inserted. This check has to be extended to consider the full FieldSchema > (with inner subschema) in order to decide whether a cast is needed. > 2) POCast should be changed to handle casting a complex type to the type > specified the user supplied FieldSchema. Here is there is one issue to be > considered - if the user specified the cast type to be bg:{t:(i:int, j:int)} > and the real data had only one field what should the result of the cast be: > * A bag with two fields - the int field and a null? - In this approach pig > is assuming the lone field in the data is the first field which might be > incorrect if it in fact is the second field. > * A null bag to indicate that the bag is of unknown value - this is the one > I personally prefer > * The cast throws an IncompatibleCastException -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1634) Multiple names for the "group" field
Multiple names for the "group" field Key: PIG-1634 URL: https://issues.apache.org/jira/browse/PIG-1634 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0 Reporter: Viraj Bhat I am hoping that in Pig if I type {quote} c = cogroup a by foo, b by bar", the fields c.group, c.foo and c.bar should all map to c.$0 {quote} This would improve the readability of the Pig script. Here's a real usecase: {code} --- pages = LOAD 'pages.dat' AS (url, pagerank); visits = LOAD 'user_log.dat' AS (user_id, url); page_visits = COGROUP pages BY url, visits BY url; frequent_visits = FILTER page_visits BY COUNT(visits) >= 2; answer = FOREACH frequent_visits GENERATE url, FLATTEN(pages.pagerank); --- {code} (The important part is the final GENERATE statement, which references the field "url", which was the grouping field in the earlier COGROUP.) To get it to work I have to write it in a less intuitive way. Maybe with the new parser changes in Pig 0.9 it would be easier to specify that. Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1633) Using an alias withing Nested Foreach causes indeterminate behaviour
Using an alias withing Nested Foreach causes indeterminate behaviour Key: PIG-1633 URL: https://issues.apache.org/jira/browse/PIG-1633 Project: Pig Issue Type: Bug Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0 Reporter: Viraj Bhat I have created a RANDOMINT function which generates random numbers between (0 and specified value), For example RANDOMINT(4) gives random numbers between 0 and 3 (inclusive) {code} $hadoop fs -cat rand.dat f g h i j k l m {code} The pig script is as follows: {code} register math.jar; A = load 'rand.dat' using PigStorage() as (data); B = foreach A { r = math.RANDOMINT(4); generate data, r as random, ((r == 3)?1:0) as quarter; }; dump B; {code} The results are as follows: {code} {color:red} (f,0,0) (g,3,0) (h,0,0) (i,2,0) (j,3,0) (k,2,0) (l,0,1) (m,1,0) {color} {code} If you observe, (j,3,0) is created because r is used both in the foreach and generate clauses and generate different values. Modifying the above script to below solves the issue. The M/R jobs from both scripts are the same. It is just a matter of convenience. {code} A = load 'rand.dat' using PigStorage() as (data); B = foreach A generate data, math.RANDOMINT(4) as r; C = foreach B generate data, r, ((r == 3)?1:0) as quarter; dump C; {code} Is this issue related to PIG:747? Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated PIG-1632: - Status: Patch Available (was: Open) > The core jar in the tarball contains the kitchen sink > -- > > Key: PIG-1632 > URL: https://issues.apache.org/jira/browse/PIG-1632 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.8.0, 0.9.0 >Reporter: Eli Collins > Fix For: site, 0.9.0 > > Attachments: pig-1632-1.patch > > > The core jar in the tarball contains the kitchen sink, it's not the same core > jar built by ant jar. This is problematic since other projects that want to > depend on the pig core jar just want pig core, but > pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff > (hadoop, com.google, commons, etc) that may conflict with the packages also > on a user's classpath. > {noformat} > pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l > 12 > pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz > ... > pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v > pig|wc -l > 4819 > {noformat} > How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1632) The core jar in the tarball contains the kitchen sink
[ https://issues.apache.org/jira/browse/PIG-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eli Collins updated PIG-1632: - Attachment: pig-1632-1.patch Attached patch updates the package target so that the tarball, and therefore Pig release, just contain the Pig core jar. If a Pig release needs to bundle Hadoop and a bunch of other stuff perhaps we could put those jars in lib instead of the core jar. Running things like the tests out of a tarball that just includes the core jar works as these come in via ivy, anything else that needs to be tested? > The core jar in the tarball contains the kitchen sink > -- > > Key: PIG-1632 > URL: https://issues.apache.org/jira/browse/PIG-1632 > Project: Pig > Issue Type: Bug > Components: build >Affects Versions: 0.8.0, 0.9.0 >Reporter: Eli Collins > Fix For: site, 0.9.0 > > Attachments: pig-1632-1.patch > > > The core jar in the tarball contains the kitchen sink, it's not the same core > jar built by ant jar. This is problematic since other projects that want to > depend on the pig core jar just want pig core, but > pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff > (hadoop, com.google, commons, etc) that may conflict with the packages also > on a user's classpath. > {noformat} > pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l > 12 > pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz > ... > pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v > pig|wc -l > 4819 > {noformat} > How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1632) The core jar in the tarball contains the kitchen sink
The core jar in the tarball contains the kitchen sink -- Key: PIG-1632 URL: https://issues.apache.org/jira/browse/PIG-1632 Project: Pig Issue Type: Bug Components: build Affects Versions: 0.8.0, 0.9.0 Reporter: Eli Collins Fix For: site, 0.9.0 The core jar in the tarball contains the kitchen sink, it's not the same core jar built by ant jar. This is problematic since other projects that want to depend on the pig core jar just want pig core, but pig-0.8.0-SNAPSHOT-core.jar in the tarball contains a bunch of other stuff (hadoop, com.google, commons, etc) that may conflict with the packages also on a user's classpath. {noformat} pig1 (trunk)$ jar tvf build/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 12 pig1 (trunk)$ tar xvzf build/pig-0.8.0-SNAPSHOT.tar.gz ... pig1 (trunk)$ jar tvf pig-0.8.0-SNAPSHOT/pig-0.8.0-SNAPSHOT-core.jar |grep -v pig|wc -l 4819 {noformat} How about restricting the core jar to just Pig classes? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput
[ https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich reassigned PIG-1579: --- Assignee: Daniel Dai > Intermittent unit test failure for > TestScriptUDF.testPythonScriptUDFNullInputOutput > --- > > Key: PIG-1579 > URL: https://issues.apache.org/jira/browse/PIG-1579 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1579-1.patch > > > Error message: > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error > executing function: Traceback (most recent call last): > File "", line 5, in multStr > TypeError: can't multiply sequence by non-int of type 'NoneType' > at > org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) > at org.apache.hadoop.mapred.Child.main(Child.java:170) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1631) Support to 2 level nested foreach
Support to 2 level nested foreach - Key: PIG-1631 URL: https://issues.apache.org/jira/browse/PIG-1631 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat What I would like to do is generate certain metrics for every listing impression in the context of a page like clicks on the page etc. So, I first group by to get clicks and impression together. Now, I would want to iterate through the mini-table (one per serve-id) and compute metrics. Since nested foreach within foreach is not supported I ended up writing a UDF that took both the bags and computed the metric. It would have been elegant to keep the logic of iterating over the records outside in the PIG script. Here is some pseudocode of how I would have liked to write it: {code} -- Let us say in our page context there was click on rank 2 for which there were 3 ads A1 = LOAD '...' AS (page_id, rank); -- clicks. A2 = Load '...' AS (page_id, rank); -- impressions B = COGROUP A1 by (page_id), A2 by (page_id); -- Let us say B contains the following schema -- (group, {(A1...)} {(A2...)}) -- Each record would be in B would be: -- page_id_1, {(page_id_1, 2)} {(page_id_1, 1) (page_id_1, 2) (page_id_1, 3))} C = FOREACH B GENERATE { D = FLATTEN(A1), FLATTEN(A2); -- This wont work in current pig as well. Basically, I would like a mini-table which represents an entire serve. FOREACH D GENERATE page_id_1, A2:rank, SOMEUDF(A1:rank, A2::rank); -- This UDF returns a value (like v1, v2, v3 depending on A1::rank and A2::rank) }; # output # page_id, 1, v1 # page_id, 2, v2 # page_id, 3, v3 DUMP C; {code} P.S: I understand that I could have alternatively, flattened the fields of B and then done a GROUP on page_id and then iterated through the records calling 'SOMEUDF' appropriately but that would be 2 map-reduce operations AFAIK. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1630) Support param_files to be loaded into HDFS
Support param_files to be loaded into HDFS -- Key: PIG-1630 URL: https://issues.apache.org/jira/browse/PIG-1630 Project: Pig Issue Type: New Feature Affects Versions: 0.7.0 Reporter: Viraj Bhat I want to place the parameters of a Pig script in a param_file. But instead of this file being in the local file system where I run my java command, I want this to be on HDFS. {code} $ java -cp pig.jar org.apache.pig.Main -param_file hdfs://namenode/paramfile myscript.pig {code} Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1461) support union operation that merges based on column names
[ https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1461: --- Release Note: Documentation for UNION ONSCHEMA: Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position. If the following requirements are not met, the statement will throw an error : * All inputs to the union should have a non null schema. * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible. Example - grunt> L1 = load 'f1' using (a : int, b : float); grunt> dump L1; (11,12.0) (21,22.0) grunt> L2 = load 'f1' using (a : long, c : chararray); grunt> dump L2; (11,a) (12,b) (13,c) grunt> U = union onschema L1, L2; grunt> describe U ; U : {a : long, b : float, c : chararray} grunt> dump U; (11,12.0,) (21,22.0,) (11,,a) (12,,b) (13,,c) Note: - Alias such as 'nm::c1' and 'c1' in two separate relations specified in 'union onschema' are considered mergeable and in the schema of the union, the merged column alias will be 'c1'. - Alias such as 'nm1::c1' and 'nm2::c1' in two separate relations specified in 'union onschema' will not be merged together, in schema of the union there will be two columns with these names. Example - > describe f; f: {l1::a: int, l1::b: int, l1::c: int} > describe l1; l1: {a: int, b: int} > u = union onschema f,l1; > desc u; u: {a: int, b: int, l1::c: int} Like the default union, 'union onschema' also supports 2 or more inputs. was: Documentation for UNION ONSCHEMA: Use the keyword ONSCHEMA with union so that the union is based on column names of the input relations, and not column position. If the following requirements are not met, the statement will throw an error : * All inputs to the union should have a non null schema. * The data type for columns with same name in different input schemas should be compatible. Numeric types are compatible, and if column having same name in different input schemas have different numeric types , an implicit conversion will happen. bytearray type is considered compatible with all other types, a cast will be added to convert to other type. Bags or tuples having different inner schema are considered incompatible. Example - grunt> L1 = load 'f1' using (a : int, b : float); grunt> dump L1; (11,12.0) (21,22.0) grunt> L2 = load 'f1' using (a : long, c : chararray); grunt> dump L2; (11,a) (12,b) (13,c) grunt> U = union onschema L1, L2; grunt> describe U ; U : {a : long, b : float, c : chararray} grunt> dump U; (11,12.0,) (21,22.0,) (11,,a) (12,,b) (13,,c) Like the default union, 'union onschema' also supports 2 or more inputs. Adding release note section of PIG-1610 to this release note. > support union operation that merges based on column names > - > > Key: PIG-1461 > URL: https://issues.apache.org/jira/browse/PIG-1461 > Project: Pig > Issue Type: New Feature > Components: impl >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch > > > When the data has schema, it often makes sense to union on column names in > schema rather than the position of the columns. > The behavior of existing union operator should remain backward compatible . > This feature can be supported using either a new operator or extending union > to support 'using' clause . I am thinking of having a new operator called > either unionschema or merge . Does anybody have any other suggestions for the > syntax ? > example - > L1 = load 'x' as (a,b); > L2 = load 'y' as (b,c); > U = unionschema L1, L2; > describe U; > U: {a:bytearray, b:byetarray, c:bytearray} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1616: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk and 0.8 branch. > 'union onschema' does not use create output with correct schema when udfs are > involved > -- > > Key: PIG-1616 > URL: https://issues.apache.org/jira/browse/PIG-1616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1616.1.patch > > > 'union onshcema' creates a merged schema based on the input schemas. It does > that in the queryparser, and at that stage the udf return type used is the > default return type. The actual return type for the udf is determined later > in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). > 'union onschema' should use the final type for its input relation to create > the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912696#action_12912696 ] Richard Ding commented on PIG-1616: --- +1 > 'union onschema' does not use create output with correct schema when udfs are > involved > -- > > Key: PIG-1616 > URL: https://issues.apache.org/jira/browse/PIG-1616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1616.1.patch > > > 'union onshcema' creates a merged schema based on the input schemas. It does > that in the queryparser, and at that stage the udf return type used is the > default return type. The actual return type for the udf is determined later > in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). > 'union onschema' should use the final type for its input relation to create > the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1617) 'group all' should always use one reducer
[ https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1617: --- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to trunk and 0.8 branch. > 'group all' should always use one reducer > - > > Key: PIG-1617 > URL: https://issues.apache.org/jira/browse/PIG-1617 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1617.1.patch > > > 'group all' sends all rows to a single reducer, it does not make sense to > spawn more than one reducer for it. But if higher value of parallelism is > specified or if the input is large enough so that changes in PIG-1249 result > in larger value being set, there are additional reducers spawned that don't > do anything useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1617) 'group all' should always use one reducer
[ https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912655#action_12912655 ] Olga Natkovich commented on PIG-1617: - Looks good. +1 > 'group all' should always use one reducer > - > > Key: PIG-1617 > URL: https://issues.apache.org/jira/browse/PIG-1617 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1617.1.patch > > > 'group all' sends all rows to a single reducer, it does not make sense to > spawn more than one reducer for it. But if higher value of parallelism is > specified or if the input is large enough so that changes in PIG-1249 result > in larger value being set, there are additional reducers spawned that don't > do anything useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-348) -j command line option doesn't work
[ https://issues.apache.org/jira/browse/PIG-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corinne Chandel updated PIG-348: Status: Resolved (was: Patch Available) Resolution: Fixed Nothing to update in docs. Closing. > -j command line option doesn't work > --- > > Key: PIG-348 > URL: https://issues.apache.org/jira/browse/PIG-348 > Project: Pig > Issue Type: Improvement > Components: documentation >Reporter: Amir Youssefi >Assignee: Corinne Chandel > Fix For: 0.8.0 > > Attachments: PIG-348.path, PIG-348_1.patch > > > According to: > $ pig --help > ... > -j, -jar jarfile load jarfile > ... > yet > $pig -j my.jar > doesn't work in place of: > register my.jar > in Pig script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1112) FLATTEN eliminates the alias
[ https://issues.apache.org/jira/browse/PIG-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912624#action_12912624 ] Alan Gates commented on PIG-1112: - In the example above, the user specified that he expects two fields to come out of the flatten of ladder. This seems equivalent to saying A = load 'ladder' as (third, second). So I propose that when users give field names (and possibly types) in an AS that is attached to a flatten Pig takes that to be the schema of the flattened data. > FLATTEN eliminates the alias > > > Key: PIG-1112 > URL: https://issues.apache.org/jira/browse/PIG-1112 > Project: Pig > Issue Type: Bug >Reporter: Ankur >Assignee: Alan Gates > Fix For: 0.9.0 > > > If schema for a field of type 'bag' is partially defined then FLATTEN() > incorrectly eliminates the field and throws an error. > Consider the following example:- > A = LOAD 'sample' using PigStorage() as (first:chararray, second:chararray, > ladder:bag{}); > B = FOREACH A GENERATE first,FLATTEN(ladder) as third,second; > > C = GROUP B by (first,third); > This throws the error > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. > Invalid alias: third in {first: chararray,second: chararray} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1616: --- Status: Patch Available (was: Open) > 'union onschema' does not use create output with correct schema when udfs are > involved > -- > > Key: PIG-1616 > URL: https://issues.apache.org/jira/browse/PIG-1616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1616.1.patch > > > 'union onshcema' creates a merged schema based on the input schemas. It does > that in the queryparser, and at that stage the udf return type used is the > default return type. The actual return type for the udf is determined later > in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). > 'union onschema' should use the final type for its input relation to create > the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1616: --- Attachment: PIG-1616.1.patch PIG-1616.1.patch - calls LogicalPlanValidationExecutor.validate() to set the actual types, before the merged schema for 'union onschema' is created. Passes unit tests and test-patch. Ready for review. > 'union onschema' does not use create output with correct schema when udfs are > involved > -- > > Key: PIG-1616 > URL: https://issues.apache.org/jira/browse/PIG-1616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1616.1.patch > > > 'union onshcema' creates a merged schema based on the input schemas. It does > that in the queryparser, and at that stage the udf return type used is the > default return type. The actual return type for the udf is determined later > in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). > 'union onschema' should use the final type for its input relation to create > the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1617) 'group all' should always use one reducer
[ https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1617: --- Status: Patch Available (was: Open) > 'group all' should always use one reducer > - > > Key: PIG-1617 > URL: https://issues.apache.org/jira/browse/PIG-1617 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1617.1.patch > > > 'group all' sends all rows to a single reducer, it does not make sense to > spawn more than one reducer for it. But if higher value of parallelism is > specified or if the input is large enough so that changes in PIG-1249 result > in larger value being set, there are additional reducers spawned that don't > do anything useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1617) 'group all' should always use one reducer
[ https://issues.apache.org/jira/browse/PIG-1617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-1617: --- Attachment: PIG-1617.1.patch PIG-1617.1.patch- Patch sets parallelism of LOCogroup to 1 for group on a constant (including 'group all') Passes unit tests and test patch. Ready for review. > 'group all' should always use one reducer > - > > Key: PIG-1617 > URL: https://issues.apache.org/jira/browse/PIG-1617 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1617.1.patch > > > 'group all' sends all rows to a single reducer, it does not make sense to > spawn more than one reducer for it. But if higher value of parallelism is > specified or if the input is large enough so that changes in PIG-1249 result > in larger value being set, there are additional reducers spawned that don't > do anything useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.