[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.test-fix.patch

Here is my understanding of what happens

1. The main thread in the JVM executing the test initializes MiniDFSCluster,  
MiniMRCluster and HSQLDB server all in different threads.
2. The test setUp() method then executed to create table 'ttt' to which data 
will be written by DBStorage() in the test.
3. Pig statements are then executed that spawn M/R job as a separate process 
that tries to get a connection to the database and create a preparedStatement 
for table 'ttt'. This fails sometimes as DB thread does NOT get a chance to 
fully persist the table information and the exception is thrown from the 
map-tasks as noted by Ashutosh.

The fix for this is to add a 5 sec sleep in setUp() method to give DB a chance 
to persist table information. This alleviates the problem and test passes for 
repeated multiple runs. 

Note that Ideal fix would have been to do a busy wait for table creation 
completion but i don't see a method in HSqlDB to do that. 

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-04 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Attachment: PIG-1178-5.patch

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, pig_1178.patch, pig_1178.patch, 
 PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, 
 pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-04 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Status: Open  (was: Patch Available)

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, pig_1178.patch, pig_1178.patch, 
 PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, 
 pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-04 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1178:


Status: Patch Available  (was: Open)

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, pig_1178.patch, pig_1178.patch, 
 PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, 
 pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895165#action_12895165
 ] 

Daniel Dai commented on PIG-1178:
-

Did some restructure and bug fixing. Also move package from experimental to 
newplan.

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, pig_1178.patch, pig_1178.patch, 
 PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, 
 pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229-final.test-fix.patch)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-v2.patch, 
 jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-08-04 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-final.test-fix.patch

Aaron,
 Autocommit() was not the issue.  It was the usage of 
jdbc:hsqldb:file: url in the STORE function that was the problem. Replacing 
it with jdbc:hsqldb:hsql://localhost/dbname solved the issue. Attaching the 
updated patch with the test case modification.

Really appreciate your help here. Thanks a lot :-)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-final.patch, jira-1229-final.test-fix.patch, 
 jira-1229-v2.patch, jira-1229-v3.patch, pig-1229.2.patch, pig-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895212#action_12895212
 ] 

Hadoop QA commented on PIG-1461:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451175/PIG-1461.1.patch
  against trunk revision 981984.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 407 release audit warnings 
(more than the trunk's current 405 warnings).

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/372/console

This message is automatically generated.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1527) No need to deserialize UDFContext on the client side

2010-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895310#action_12895310
 ] 

Hadoop QA commented on PIG-1527:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451181/PIG-1527.patch
  against trunk revision 981984.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 406 release audit warnings 
(more than the trunk's current 405 warnings).

+1 core tests.  The patch passed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/373/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/373/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/373/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/373/console

This message is automatically generated.

 No need to deserialize UDFContext on the client side
 

 Key: PIG-1527
 URL: https://issues.apache.org/jira/browse/PIG-1527
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1527.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1404) PigUnit - Pig script testing simplified.

2010-08-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895318#action_12895318
 ] 

Ashutosh Chauhan commented on PIG-1404:
---

bq. 3. (This one is for other pig developers) Is Piggybank the right place for 
this or should we put it under test? I think this will be really useful for Pig 
users in setting up automated tests of their Pig Latin scripts. Should we 
support it outright rather than put it in piggybank and risk having it go 
unmaintained?

I think it deserves to be put in under test. Having written few end-to-end test 
cases of pig in junit, I can see its really useful for Pig itself. Usefulness 
for pig users is pretty obvious.

 PigUnit - Pig script testing simplified. 
 -

 Key: PIG-1404
 URL: https://issues.apache.org/jira/browse/PIG-1404
 Project: Pig
  Issue Type: New Feature
Reporter: Romain Rigaux
Assignee: Romain Rigaux
 Fix For: 0.8.0

 Attachments: commons-lang-2.4.jar, PIG-1404-2.patch, 
 PIG-1404-3-doc.patch, PIG-1404-3.patch, PIG-1404-4-doc.patch, 
 PIG-1404-4.patch, PIG-1404.patch


 The goal is to provide a simple xUnit framework that enables our Pig scripts 
 to be easily:
   - unit tested
   - regression tested
   - quickly prototyped
 No cluster set up is required.
 For example:
 TestCase
 {code}
   @Test
   public void testTop3Queries() {
 String[] args = {
 n=3,
 };
 test = new PigTest(top_queries.pig, args);
 String[] input = {
 yahoo\t10,
 twitter\t7,
 facebook\t10,
 yahoo\t15,
 facebook\t5,
 
 };
 String[] output = {
 (yahoo,25L),
 (facebook,15L),
 (twitter,7L),
 };
 test.assertOutput(data, input, queries_limit, output);
   }
 {code}
 top_queries.pig
 {code}
 data =
 LOAD '$input'
 AS (query:CHARARRAY, count:INT);
  
 ... 
 
 queries_sum = 
 FOREACH queries_group 
 GENERATE 
 group AS query, 
 SUM(queries.count) AS count;
 
 ...
 
 queries_limit = LIMIT queries_ordered $n;
 STORE queries_limit INTO '$output';
 {code}
 They are 3 modes:
 * LOCAL (if pigunit.exectype.local properties is present)
 * MAPREDUCE (use the cluster specified in the classpath, same as 
 HADOOP_CONF_DIR)
 ** automatic mini cluster (is the default and the HADOOP_CONF_DIR to have in 
 the class path will be: ~/pigtest/conf)
 ** pointing to an existing cluster (if pigunit.exectype.cluster properties 
 is present)
 For now, it would be nice to see how this idea could be integrated in 
 Piggybank and if PigParser/PigServer could improve their interfaces in order 
 to make PigUnit simple.
 Other components based on PigUnit could be built later:
   - standalone MiniCluster
   - notion of workspaces for each test
   - standalone utility that reads test configuration and generates a test 
 report...
 It is a first prototype, open to suggestions and can definitely take 
 advantage of feedbacks.
 How to test, in pig_trunk:
 {code}
 Apply patch
 $pig_trunk ant compile-test
 $pig_trunk ant
 $pig_trunk/contrib/piggybank/java ant test -Dtest.timeout=99
 {code}
 (it takes 15 min in MAPREDUCE minicluster, tests will need to be split in the 
 future between 'unit' and 'integration')
 Many examples are in:
 {code}
 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/pigunit/TestPigTest.java
 {code}
 When used as a standalone, do not forget commons-lang-2.4.jar and the 
 HADOOP_CONF_DIR to your cluster in your CLASSPATH.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1295) Binary comparator for secondary sort

2010-08-04 Thread Gianmarco De Francisci Morales (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianmarco De Francisci Morales updated PIG-1295:


Attachment: PIG-1295_0.12.patch

Ok, first working integration.
Modified PigTupleRawComparatorNew to use the raw comparators via TupleFactory.
Created a new class PigSecondaryKeyComparatorNew that should substitute the old 
one. This one uses the raw comparators.
Modified JobControlCompiler to use the new comparators.

Moved the null/index semantic outside the raw comparators and inside the 
wrappers.

Modified BinSedesTupleComparator to correctly handle sort order. The sort order 
is applied to the first call to compare tuples. In case we are doing a 
secondary sort, the sort orders are propagated 1 level more (because we have a 
nested tuple with the keys, and we need to apply the sort orders to the content 
of the outermost tuple).
The code is not the cleanest possible but TestPigTupleRawComparator and 
TestSecondarySort pass.

TODO:
Implement the logic for PIG-927.
I plan to create a new interface (TupleRawComparator) and add a method to check 
if during the comparison a field of type NULL was encountered. This interface 
will be used instead of the simple RawComparator to hold the reference to our 
raw comparators.

Write speed test.
Is there something already made that can be used to test the speed improvement? 
The inputs for the unit test are of course too small.

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1295_0.1.patch, PIG-1295_0.10.patch, 
 PIG-1295_0.11.patch, PIG-1295_0.12.patch, PIG-1295_0.2.patch, 
 PIG-1295_0.3.patch, PIG-1295_0.4.patch, PIG-1295_0.5.patch, 
 PIG-1295_0.6.patch, PIG-1295_0.7.patch, PIG-1295_0.8.patch, PIG-1295_0.9.patch


 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

More comments in code per the reviewer's comment.

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: (was: PIG-1496.patch)

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Status: Patch Available  (was: Open)

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch, PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1496) Mandatory rule ImplicitSplitInserter

2010-08-04 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1496:
--

Attachment: PIG-1496.patch

 Mandatory rule ImplicitSplitInserter
 

 Key: PIG-1496
 URL: https://issues.apache.org/jira/browse/PIG-1496
 Project: Pig
  Issue Type: Sub-task
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1496.patch, PIG-1496.patch


 Need to migrate ImplicitSplitInserter to new logical optimizer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1534) Code discovering UDFs in the script has a bug in a order by case

2010-08-04 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1534:


  Status: Patch Available  (was: Open)
Assignee: Pradeep Kamath

 Code discovering UDFs in the script has a bug in a order by case
 

 Key: PIG-1534
 URL: https://issues.apache.org/jira/browse/PIG-1534
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1534.patch


 Consider the following commandline:
 {noformat}
 java -cp /tmp/svncheckout/pig.jar:udf.jar:clusterdir org.apache.pig.Main -e 
 a = load 'studenttab' using udf.MyPigStorage(); b = order a by $0; dump b;
 {noformat}
 Notice there is no register udf.jar, instead udf.jar (which contains 
 udf.MyPigStorage) is in the classpath. Pig handles this case by shipping 
 udf.jar to the backend. However the above script with order by triggers the 
 bug with the following error message:
  ERROR 2997: Unable to recreate exception from backed error: 
 java.lang.RuntimeException: could not instantiate 
 'org.apache.pig.impl.builtin.RandomSampleLoader' with arguments 
 '[udf.MyPigStorage, 100]'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1534) Code discovering UDFs in the script has a bug in a order by case

2010-08-04 Thread Pradeep Kamath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-1534:


Attachment: PIG-1534.patch

Patch fixes SampleOptimizer to add the loadFunc funcspecs into the Mapreduce 
operators after optimization - this fixes the above order by error.

Here are results from running the test-patch target locally
[exec] -1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] -1 javadoc.  The javadoc tool appears to have generated 1 
warning messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec]

The javadoc warning is present on trunk and not related to this patch:
{noformat}
...
 [javadoc] Standard Doclet version 1.6.0_01
  [javadoc] Building tree for all the packages and classes...
  [javadoc] 
/tmp/svncheckout/src/org/apache/pig/newplan/logical/expression/ProjectExpression.java:192:
 warning - @param argument currentOp is not a parameter name.
  [javadoc] Building index for all the packages and classes...
...
{noformat}
Will run unit tests locally and update with results.

 Code discovering UDFs in the script has a bug in a order by case
 

 Key: PIG-1534
 URL: https://issues.apache.org/jira/browse/PIG-1534
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1534.patch


 Consider the following commandline:
 {noformat}
 java -cp /tmp/svncheckout/pig.jar:udf.jar:clusterdir org.apache.pig.Main -e 
 a = load 'studenttab' using udf.MyPigStorage(); b = order a by $0; dump b;
 {noformat}
 Notice there is no register udf.jar, instead udf.jar (which contains 
 udf.MyPigStorage) is in the classpath. Pig handles this case by shipping 
 udf.jar to the backend. However the above script with order by triggers the 
 bug with the following error message:
  ERROR 2997: Unable to recreate exception from backed error: 
 java.lang.RuntimeException: could not instantiate 
 'org.apache.pig.impl.builtin.RandomSampleLoader' with arguments 
 '[udf.MyPigStorage, 100]'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895383#action_12895383
 ] 

Olga Natkovich commented on PIG-1461:
-

The patch looks good. A couple of comments:

(1) Looks like there is a type in the code that loads data for testing:
w.println(5\tdef\t3\t{(2,a),(2,b)}]); - contains an extra ] at the end
(2) This is not related to the patch but to the documentation above. Please, 
add info that UNION supports 2 or more inputs.
(3) In mergeSchemasByAlias, I think it is safer to make copy of the schema 
rather than just assigning it for the corner case of 1 schema.
(4) Need to add a comment about inner bag schema to 
mergeFieldSchemaFirstLevelSameAlias
(5) General comment on schema merging - we have completely different code path 
for posiiton vs. alias based merge. I am worried that we will have subtly 
different semantics either now or later.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-04 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Attachment: ScalarImplFinaleRebase.patch

Attaching rebased version of the patch...

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
 ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1536) use same logic for merging inner schemas in default union and union onschema

2010-08-04 Thread Thejas M Nair (JIRA)
use same logic for merging inner schemas in default union and union onschema


 Key: PIG-1536
 URL: https://issues.apache.org/jira/browse/PIG-1536
 Project: Pig
  Issue Type: Task
Reporter: Thejas M Nair
 Fix For: 0.9.0


We should consider using logic for merging inner schema in case of the two 
different types of union. 

In case of 'default union', it merges the two inner schema of bags/tuples by 
position if the number of fields are same and the corresponding types are 
compatible. 

In case of 'union onschema', it considers tuple/bag with different innerschema 
to be incompatible types.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1536) use same logic for merging inner schemas in default union and union onschema

2010-08-04 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895410#action_12895410
 ] 

Thejas M Nair commented on PIG-1536:



The way 'default union' deals with columns of different but compatible types in 
same position is not right. It creates a merged schema choosing a merged type, 
but there is not cast that happens to convert the rows to this type.
eg -

{code}
grunt l1 = load '/tmp/f1' as (a : chararray, t (a : int, c : long) );
grunt l2 = load '/tmp/f1' as (a : chararray, t (a : int, b : int) ); 
grunt u = union l1, l2;  
grunt describe u;
u: {a: chararray,t: (a: int,c: long)}

-- the result of u, only the rows originating from l1 will correspond to schema 
shown in describe.

MapReduce node 1-206
Map Plan
u: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-203
|
|---u: Union[bag] - 1-202
|
|---l1: New For Each(false,false)[bag] - 1-195
|   |   |
|   |   Cast[chararray] - 1-192
|   |   |
|   |   |---Project[bytearray][0] - 1-191
|   |   |
|   |   Cast[tuple:(int,long)] - 1-194
|   |   |
|   |   |---Project[bytearray][1] - 1-193
|   |
|   |---l1: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-190
|
|---l2: New For Each(false,false)[bag] - 1-201
|   |
|   Cast[chararray] - 1-198
|   |
|   |---Project[bytearray][0] - 1-197
|   |
|   Cast[tuple:(int,int)] - 1-200
|   |
|   |---Project[bytearray][1] - 1-199
|
|---l2: Load(/tmp/f1:org.apache.pig.builtin.PigStorage) - 1-196
Global sort: false


{code}

 use same logic for merging inner schemas in default union and union 
 onschema
 

 Key: PIG-1536
 URL: https://issues.apache.org/jira/browse/PIG-1536
 Project: Pig
  Issue Type: Task
Reporter: Thejas M Nair
 Fix For: 0.9.0


 We should consider using logic for merging inner schema in case of the two 
 different types of union. 
 In case of 'default union', it merges the two inner schema of bags/tuples by 
 position if the number of fields are same and the corresponding types are 
 compatible. 
 In case of 'union onschema', it considers tuple/bag with different 
 innerschema to be incompatible types.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895414#action_12895414
 ] 

Thejas M Nair commented on PIG-1461:


Regarding 5, there are some differences in the way schema merge is done in both 
cases. I have created PIG-1536 to discuss/address this  .
I will make changes to address other comments.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-346) Grunt (help) commands

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-346:
---


We also need to make sure to cover commens that implemented in PigServer. This 
is from PIG-523 that I will close as duplicate of this bug

 Grunt (help) commands 
 --

 Key: PIG-346
 URL: https://issues.apache.org/jira/browse/PIG-346
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
 Fix For: 0.8.0


 I think there are 22 grunt commands  and 2 different lists of the 
 commands can be displayed.
 I. Grunt commands displayed with grunt help
 (1) put 22 grunt commands in alphabetical order
 (2) fix double entry for cd ... cd path and cd dir  keep cd path
 (3) fix notation for set key value ... set key 'value'
 (4) add explain
 (5) add illustrate
 (6) add help
 II. Grunt commands display with grunt asdf 
 The asdf is a mistake and generates msg Was expecting one of: and list of 
 grunt commands
 (1) put 22 grunt commands in alphabetical order
 (2) add define
 (3) add du
 
 22 Grunt commands in aphabetical order:
 cat src
 cd path
 copyFromLocal localsrc dst
 copyToLocal src localdst
 cp src dst
 define functionAlias functionSpec
 describe alias
 dump alias
 du path
 explain
 help
 illustrate
 kill job_id
 ls path
 mkdir path
 mv src dst
 pwd
 quit
 register udfJar
 rm src
 set key 'value'
 store alias into filename [using functionSpec]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-523) help in grunt should show all commands

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-523.


Resolution: Duplicate

I moved this into PIG-346 

 help in grunt should show all commands
 --

 Key: PIG-523
 URL: https://issues.apache.org/jira/browse/PIG-523
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Olga Natkovich
Priority: Minor
 Fix For: 0.8.0


 curently, it only show commands directly supported by grunt parser and not 
 command supported by pig parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895430#action_12895430
 ] 

Thejas M Nair commented on PIG-1461:


Regarding Documentation for UNION ONSCHEMA:  -
As Olga mentioned, like the default union, 'union onschema' also supports 2 or 
more inputs.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-04 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Status: Patch Available  (was: Open)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
 ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-04 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-1434:


Status: Open  (was: Patch Available)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
 ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1461:
---

Attachment: PIG-1461.2.patch

Patch with changes as suggested in code review.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1461) support union operation that merges based on column names

2010-08-04 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1461:
---

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

Patch committed to trunk.

 support union operation that merges based on column names
 -

 Key: PIG-1461
 URL: https://issues.apache.org/jira/browse/PIG-1461
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1461.1.patch, PIG-1461.2.patch, PIG-1461.patch


 When the data has schema, it often makes sense to union on column names in 
 schema rather than the position of the columns. 
 The behavior of existing union operator should remain backward compatible .
 This feature can be supported using either a new operator or extending union 
 to support 'using' clause . I am thinking of having a new operator called 
 either unionschema or merge . Does anybody have any other suggestions for the 
 syntax ?
 example -
 L1 = load 'x' as (a,b);
 L2 = load 'y' as (b,c);
 U = unionschema L1, L2;
 describe U;
 U: {a:bytearray, b:byetarray, c:bytearray}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1434) Allow casting relations to scalars

2010-08-04 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1434:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed
Tags: documentation

The patch is committed to trunk. Thanks Aniket for contributing this feature.

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch, ScalarImpl1.patch, ScalarImpl5.patch, 
 ScalarImplFinale.patch, ScalarImplFinale1.patch, ScalarImplFinaleRebase.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1199) help includes obsolete options

2010-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895460#action_12895460
 ] 

Hadoop QA commented on PIG-1199:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451182/PIG-1199.patch
  against trunk revision 981984.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

-1 release audit.  The applied patch generated 406 release audit warnings 
(more than the trunk's current 405 warnings).

-1 core tests.  The patch failed core unit tests.

-1 contrib tests.  The patch failed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/374/testReport/
Release audit warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/374/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/374/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/374/console

This message is automatically generated.

 help includes obsolete options
 --

 Key: PIG-1199
 URL: https://issues.apache.org/jira/browse/PIG-1199
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1199.patch


 This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1199) help includes obsolete options

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895462#action_12895462
 ] 

Olga Natkovich commented on PIG-1199:
-

The patch just changes help message which I tested manually - hence no new 
tests. Release audit warning is in html file. The tests are failing for 
unrelated reasons. I ran test commit and I think it should be sufficient for 
this patch since it is not touching any real code.



 help includes obsolete options
 --

 Key: PIG-1199
 URL: https://issues.apache.org/jira/browse/PIG-1199
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1199.patch


 This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1178) LogicalPlan and Optimizer are too complex and hard to work with

2010-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895463#action_12895463
 ] 

Hadoop QA commented on PIG-1178:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12451203/PIG-1178-5.patch
  against trunk revision 982423.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 91 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/375/console

This message is automatically generated.

 LogicalPlan and Optimizer are too complex and hard to work with
 ---

 Key: PIG-1178
 URL: https://issues.apache.org/jira/browse/PIG-1178
 Project: Pig
  Issue Type: Improvement
Reporter: Alan Gates
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: expressions-2.patch, expressions.patch, lp.patch, 
 lp.patch, PIG-1178-4.patch, PIG-1178-5.patch, pig_1178.patch, pig_1178.patch, 
 PIG_1178.patch, pig_1178_2.patch, pig_1178_3.2.patch, pig_1178_3.3.patch, 
 pig_1178_3.4.patch, pig_1178_3.patch


 The current implementation of the logical plan and the logical optimizer in 
 Pig has proven to not be easily extensible. Developer feedback has indicated 
 that adding new rules to the optimizer is quite burdensome. In addition, the 
 logical plan has been an area of numerous bugs, many of which have been 
 difficult to fix. Developers also feel that the logical plan is difficult to 
 understand and maintain. The root cause for these issues is that a number of 
 design decisions that were made as part of the 0.2 rewrite of the front end 
 have now proven to be sub-optimal. The heart of this proposal is to revisit a 
 number of those proposals and rebuild the logical plan with a simpler design 
 that will make it much easier to maintain the logical plan as well as extend 
 the logical optimizer. 
 See http://wiki.apache.org/pig/PigLogicalPlanOptimizerRewrite for full 
 details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-347) Pig (help) Commands

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich resolved PIG-347.


Resolution: Fixed

 Pig (help) Commands
 ---

 Key: PIG-347
 URL: https://issues.apache.org/jira/browse/PIG-347
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
Priority: Minor
 Fix For: 0.8.0


 Pig help can be specified 2 ways: $pig -help and $pig -h
 I. $pig -help (seen by external/internal users)
 (1) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default
 (2) change 
 -x, -exectype local|mapreduce, mapreduce is default 
  change mapdreduce to hadoop (maintain backward compatibility)
 II. $pig -h (seen by internal users users only)
 (1) fix typos
 -l, --latest   use latest, untested, unsupported version of pig.jar instaed 
 of relased, tested, supported version.
instead of released 
 (2) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default 
 (same as above)
 (3) change:  -x, -exectype local|mapreduce, mapreduce is default ... 
  change mapdreduce to hadoop (maintain backward compatibility)
 (same as above)
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-347) Pig (help) Commands

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895473#action_12895473
 ] 

Olga Natkovich commented on PIG-347:


(1) has been done for a while
(2) we don't support hadoop. We use value mapred or mapreduce and I am not sure 
we should change it now
(3) Part || is internal to yahoo.

Closing this bug

 Pig (help) Commands
 ---

 Key: PIG-347
 URL: https://issues.apache.org/jira/browse/PIG-347
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
Priority: Minor
 Fix For: 0.8.0


 Pig help can be specified 2 ways: $pig -help and $pig -h
 I. $pig -help (seen by external/internal users)
 (1) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default
 (2) change 
 -x, -exectype local|mapreduce, mapreduce is default 
  change mapdreduce to hadoop (maintain backward compatibility)
 II. $pig -h (seen by internal users users only)
 (1) fix typos
 -l, --latest   use latest, untested, unsupported version of pig.jar instaed 
 of relased, tested, supported version.
instead of released 
 (2) fix
 -c, -cluster clustername, kryptonite is default 
  remove kryptonite is default 
 (same as above)
 (3) change:  -x, -exectype local|mapreduce, mapreduce is default ... 
  change mapdreduce to hadoop (maintain backward compatibility)
 (same as above)
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1199) help includes obsolete options

2010-08-04 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895476#action_12895476
 ] 

Thejas M Nair commented on PIG-1199:


+1 .
We should change the statement about pig.cachedbag.memusage - Note that this 
memory is shared across all large bags used by the application. .
 InternalDistinctBag and InternalSortedBag are not aware of the actual number 
of bags that it needs to share the space with. The constructor argument is 
passed 3 as the number of bags in all cases ( distinct udf, PODistinct, POSort).


 help includes obsolete options
 --

 Key: PIG-1199
 URL: https://issues.apache.org/jira/browse/PIG-1199
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1199.patch


 This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1199) help includes obsolete options

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895480#action_12895480
 ] 

Olga Natkovich commented on PIG-1199:
-

Thanks, Thejas. I am going to leave the statement as is until we actually 
figure out what it should. This is also what our documentation states - so we 
can update both places at once as needed.

I will commit the patch once I get review from Corinne.

 help includes obsolete options
 --

 Key: PIG-1199
 URL: https://issues.apache.org/jira/browse/PIG-1199
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Olga Natkovich
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-1199.patch


 This is confusing to users

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-346) Grunt (help) commands

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-346:
---

Status: Patch Available  (was: Open)

 Grunt (help) commands 
 --

 Key: PIG-346
 URL: https://issues.apache.org/jira/browse/PIG-346
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-346.patch


 I think there are 22 grunt commands  and 2 different lists of the 
 commands can be displayed.
 I. Grunt commands displayed with grunt help
 (1) put 22 grunt commands in alphabetical order
 (2) fix double entry for cd ... cd path and cd dir  keep cd path
 (3) fix notation for set key value ... set key 'value'
 (4) add explain
 (5) add illustrate
 (6) add help
 II. Grunt commands display with grunt asdf 
 The asdf is a mistake and generates msg Was expecting one of: and list of 
 grunt commands
 (1) put 22 grunt commands in alphabetical order
 (2) add define
 (3) add du
 
 22 Grunt commands in aphabetical order:
 cat src
 cd path
 copyFromLocal localsrc dst
 copyToLocal src localdst
 cp src dst
 define functionAlias functionSpec
 describe alias
 dump alias
 du path
 explain
 help
 illustrate
 kill job_id
 ls path
 mkdir path
 mv src dst
 pwd
 quit
 register udfJar
 rm src
 set key 'value'
 store alias into filename [using functionSpec]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-346) Grunt (help) commands

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-346:
---

Attachment: PIG-346.patch

 Grunt (help) commands 
 --

 Key: PIG-346
 URL: https://issues.apache.org/jira/browse/PIG-346
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-346.patch


 I think there are 22 grunt commands  and 2 different lists of the 
 commands can be displayed.
 I. Grunt commands displayed with grunt help
 (1) put 22 grunt commands in alphabetical order
 (2) fix double entry for cd ... cd path and cd dir  keep cd path
 (3) fix notation for set key value ... set key 'value'
 (4) add explain
 (5) add illustrate
 (6) add help
 II. Grunt commands display with grunt asdf 
 The asdf is a mistake and generates msg Was expecting one of: and list of 
 grunt commands
 (1) put 22 grunt commands in alphabetical order
 (2) add define
 (3) add du
 
 22 Grunt commands in aphabetical order:
 cat src
 cd path
 copyFromLocal localsrc dst
 copyToLocal src localdst
 cp src dst
 define functionAlias functionSpec
 describe alias
 dump alias
 du path
 explain
 help
 illustrate
 kill job_id
 ls path
 mkdir path
 mv src dst
 pwd
 quit
 register udfJar
 rm src
 set key 'value'
 store alias into filename [using functionSpec]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-346) Grunt (help) commands

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895483#action_12895483
 ] 

Olga Natkovich commented on PIG-346:


I have made the following changes:

(1) Removed depricated file system commands
(2) Organized the information the same way it is organized in the documentation
(3) Added more detailed information (used info from docs)
(4) Made sure that all commands are covered

 Grunt (help) commands 
 --

 Key: PIG-346
 URL: https://issues.apache.org/jira/browse/PIG-346
 Project: Pig
  Issue Type: Bug
Reporter: Corinne Chandel
Assignee: Olga Natkovich
 Fix For: 0.8.0

 Attachments: PIG-346.patch


 I think there are 22 grunt commands  and 2 different lists of the 
 commands can be displayed.
 I. Grunt commands displayed with grunt help
 (1) put 22 grunt commands in alphabetical order
 (2) fix double entry for cd ... cd path and cd dir  keep cd path
 (3) fix notation for set key value ... set key 'value'
 (4) add explain
 (5) add illustrate
 (6) add help
 II. Grunt commands display with grunt asdf 
 The asdf is a mistake and generates msg Was expecting one of: and list of 
 grunt commands
 (1) put 22 grunt commands in alphabetical order
 (2) add define
 (3) add du
 
 22 Grunt commands in aphabetical order:
 cat src
 cd path
 copyFromLocal localsrc dst
 copyToLocal src localdst
 cp src dst
 define functionAlias functionSpec
 describe alias
 dump alias
 du path
 explain
 help
 illustrate
 kill job_id
 ls path
 mkdir path
 mv src dst
 pwd
 quit
 register udfJar
 rm src
 set key 'value'
 store alias into filename [using functionSpec]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-08-04 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895487#action_12895487
 ] 

Olga Natkovich commented on PIG-1150:
-

Dmitry, the patch is missing unit tests. Once, you add them, I will commit it.

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1371) Pig should handle deep casting of complex types

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1371:


Fix Version/s: (was: 0.8.0)

It does not look like we will have time to do this in 0.8.0

 Pig should handle deep casting of complex types 
 

 Key: PIG-1371
 URL: https://issues.apache.org/jira/browse/PIG-1371
 Project: Pig
  Issue Type: Bug
Reporter: Pradeep Kamath
Assignee: Richard Ding
 Attachments: PIG-1371-partial.patch


 Consider input data in BinStorage format which has a field of bag type - 
 bg:{t:(i:int)}. In the load statement if the schema specified has the type 
 for this field specified as bg:{t:(c:chararray)}, the current behavior is 
 that Pig thinks of the field to be of type specified in the load statement 
 (bg:{t:(c:chararray)}) but no deep cast from bag of int (the real data) to 
 bag of chararray (the user specified schema) is made.
 There are two issues currently:
 1) The TypeCastInserter only considers the byte 'type' between the loader 
 presented schema and user specified schema to decided whether to introduce a 
 cast or not. In the above case since both schema have the type bag no cast 
 is inserted. This check has to be extended to consider the full FieldSchema 
 (with inner subschema) in order to decide whether a cast is needed.
 2) POCast should be changed to handle casting a complex type to the type 
 specified the user supplied FieldSchema. Here is there is one issue to be 
 considered - if the user specified the cast type to be bg:{t:(i:int, j:int)} 
 and the real data had only one field what should the result of the cast be:
  * A bag with two fields - the int field and a null? - In this approach pig 
 is assuming the lone field in the data is the first field which might be 
 incorrect if it in fact is the second field.
  * A null bag to indicate that the bag is of unknown value - this is the one 
 I personally prefer
  * The cast throws an IncompatibleCastException

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1526:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

patch committed to the trunk. Thanks Gerrit!

 HiveColumnarLoader Partitioning Support
 ---

 Key: PIG-1526
 URL: https://issues.apache.org/jira/browse/PIG-1526
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1526-2.patch, PIG-1526.patch


 I've made allot improvements on the HiveColumnarLoader:
 - Added support for LoadMetadata and data path Partitioning 
 - Improved and simplefied column loading
 Data Path Partitioning:
 Hive stores partitions as folders like to 
 /mytable/partition1=[value]/partition2=[value]. That is the table mytable 
 contains 2 partitions [partition1, partition2].
 The HiveColumnarLoader will scan the inputpath /mytable and add to the 
 PigSchema the columns partition2 and partition2. 
 These columns can then be used in filtering. 
 For example: We've got year,month,day,hour partitions in our data uploads.
 So a table might look like mytable/year=2010/month=02/day=01.
 Loading with the HiveColumnarLoader allows our pig scripts do filter by date 
 using the standard pig Filter operator.
 I've added 2 classes for this:
 - PathPartitioner
 - PathPartitionHelper
 These classes are not hive dependent and could be used by any other loader 
 that wants to support partitioning and helps with implementing the 
 LoadMetadata interface.
 For this reason I though it best to put it into the package 
 org.apache.pig.piggybank.storage.partition.
 What would be nice is in the future have the PigStorage also use these 2 
 classes to provide automatic path partitioning support. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1533) Compression codec should be a per-store property

2010-08-04 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895496#action_12895496
 ] 

Richard Ding commented on PIG-1533:
---

Locally ran and passed core tests. 

 Compression codec should be a per-store property
 

 Key: PIG-1533
 URL: https://issues.apache.org/jira/browse/PIG-1533
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1533.patch


 The following script with multi-query optimization
 {code}
 a = load 'input';
 store a into 'outout.bz2';
 store a into 'outout2'
 {code}
 generates two .bz files, while only one of them should be compressed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1386) UDF to extend functionalities of MaxTupleBy1stField

2010-08-04 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1386:


Status: Resolved  (was: Patch Available)
Resolution: Fixed

patch committed. Thanks hc busy!

 UDF to extend functionalities of MaxTupleBy1stField
 ---

 Key: PIG-1386
 URL: https://issues.apache.org/jira/browse/PIG-1386
 Project: Pig
  Issue Type: New Feature
  Components: tools
Affects Versions: 0.6.0
Reporter: hc busy
Assignee: hc busy
 Fix For: 0.8.0

 Attachments: PIG-1386-trunk.patch


 Based on this conversation:
 totally, go for it, it'd be pretty straightforward to add this
 functionality.
 - Hide quoted text -
 On Tue, Apr 20, 2010 at 6:45 PM, hc busy hc.b...@gmail.com wrote:
  Hey, while we're on the subject, and I have your attention, can we
  re-factor
  the UDF MaxTupleByFirstField to take constructor?
 
  *define customMaxTuple ExtremalTupleByNthField(n, 'min');*
  *G = group T by id;*
  *M = foreach T generate customMaxTuple(T);
  *
 
  Where n is the nth field, and the second parameter allows us to specify
  min, max, median,  etc...
 
  Does this seem like something useful to everyone?
 
 
 
  On Tue, Apr 20, 2010 at 6:34 PM, hc busy hc.b...@gmail.com wrote:
 
   What about making them part of the language using symbols?
  
   instead of
  
   foreach T generate Tuple($0, $1, $2), Bag($3, $4, $5), $6, $7;
  
   have language support
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, $6, $7;
  
   or even:
  
   foreach T generate ($0, $1, $2), {$3, $4, $5}, [$6#$7, $8#$9], $10, $11;
  
  
   Is there reason not to do the second or third other than being more
   complicated?
  
   Certainly I'd volunteer to put the top implementation in to the util
   package and submit them for builtin's, but the latter syntactic candies
   seems more natural..
  
  
  
   On Tue, Apr 20, 2010 at 5:24 PM, Alan Gates ga...@yahoo-inc.com wrote:
  
   The grouping package in piggybank is left over from back when Pig
  allowed
   users to define grouping functions (0.1).  Functions like these should
  go in
   evaluation.util.
  
   However, I'd consider putting these in builtin (in main Pig) instead.
These are things everyone asks for and they seem like a reasonable
  addition
   to the core engine.  This will be more of a burden to write (as we'll
  hold
   them to a higher standard) but of more use to people as well.
  
   Alan.
  
  
   On Apr 19, 2010, at 12:53 PM, hc busy wrote:
  
Some times I wonder... I mean, somebody went to the trouble of making a
   path
   called
  
   org.apache.pig.piggybank.grouping
  
   (where it seems like this code belong), but didn't check in any java
  code
   into that package.
  
  
   Any comment about where to put this kind of utility classes?
  
  
  
   On Mon, Apr 19, 2010 at 12:07 PM, Andrey S oct...@gmail.com wrote:
  
2010/4/19 hc busy hc.b...@gmail.com
  
That's just the way it is right now, you can't make bags or tuples
   directly... Maybe we should have some UDF's in piggybank for these:
  
   toBag()
   toTuple(); --which is kinda like exec(Tuple in){return in;}
   TupleToBag(); --some times you need it this way for some reason.
  
  
Ok. I place my current code here, may be later I make a patch (if
  such
   implementation is acceptable of course).
  
   import org.apache.pig.EvalFunc;
   import org.apache.pig.data.BagFactory;
   import org.apache.pig.data.DataBag;
   import org.apache.pig.data.Tuple;
   import org.apache.pig.data.TupleFactory;
  
   import java.io.IOException;
  
   /**
   * Convert any sequence of fields to bag with specified count of
   fieldsbr
   * Schema: count:int, fld1 [, fld2, fld3, fld4... ].
   * Output: count=2, then { (fld1, fld2) , (fld3, fld4) ... }
   *
   * @author astepachev
   */
   public class ToBag extends EvalFuncDataBag {
public BagFactory bagFactory;
public TupleFactory tupleFactory;
  
public ToBag() {
bagFactory = BagFactory.getInstance();
tupleFactory = TupleFactory.getInstance();
}
  
@Override
public DataBag exec(Tuple input) throws IOException {
if (input.isNull())
return null;
final DataBag bag = bagFactory.newDefaultBag();
final Integer couter = (Integer) input.get(0);
if (couter == null)
return null;
Tuple tuple = tupleFactory.newTuple();
for (int i = 0; i  input.size() - 1; i++) {
if (i % couter == 0) {
tuple = tupleFactory.newTuple();
bag.add(tuple);
}
tuple.append(input.get(i + 1));
}
return bag;
}
   }
  
   import org.apache.pig.ExecType;
   import org.apache.pig.PigServer;
   import org.junit.Before;
   import org.junit.Test;
 

[jira] Updated: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1537:


Description: 
I have script which is of this pattern and it uses 2 StoreFunc's:

{code}
register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');
{code}

I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

  was:
I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj


 Column pruner causes wrong results when using both Custom Store UDF and 
 PigStorage
 --

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat

 I have script which is of this pattern and it uses 2 StoreFunc's:
 {code}
 register loader.jar
 register piggy-bank/java/build/storage.jar;
 %DEFAULT OUTPUTDIR /user/viraj/prunecol/
 ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);
 ss_sc_filtered_0 = FILTER ss_sc_0 BY
 a#'id' matches '1.*' OR
 a#'id' matches '2.*' OR
 a#'id' matches '3.*' OR
 a#'id' matches '4.*';
 ss_sc_1 = LOAD 

[jira] Created: (PIG-1537) Column pruner causes wrong results when using both Custom Store UDF and PigStorage

2010-08-04 Thread Viraj Bhat (JIRA)
Column pruner causes wrong results when using both Custom Store UDF and 
PigStorage
--

 Key: PIG-1537
 URL: https://issues.apache.org/jira/browse/PIG-1537
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Viraj Bhat


I have script which is of this pattern and it uses 2 StoreFunc's:
{code}

register loader.jar
register piggy-bank/java/build/storage.jar;
%DEFAULT OUTPUTDIR /user/viraj/prunecol/

ss_sc_0 = LOAD '/data/click/20100707/0' USING Loader() AS (a, b, c);

ss_sc_filtered_0 = FILTER ss_sc_0 BY
a#'id' matches '1.*' OR
a#'id' matches '2.*' OR
a#'id' matches '3.*' OR
a#'id' matches '4.*';

ss_sc_1 = LOAD '/data/click/20100707/1' USING Loader() AS (a, b, c);

ss_sc_filtered_1 = FILTER ss_sc_1 BY
a#'id' matches '65.*' OR
a#'id' matches '466.*' OR
a#'id' matches '043.*' OR
a#'id' matches '044.*' OR
a#'id' matches '0650.*' OR
a#'id' matches '001.*';

ss_sc_all = UNION ss_sc_filtered_0,ss_sc_filtered_1;

ss_sc_all_proj = FOREACH ss_sc_all GENERATE
a#'query' as query,
a#'testid' as testid,
a#'timestamp' as timestamp,
a,
b,
c;

ss_sc_all_ord = ORDER ss_sc_all_proj BY query,testid,timestamp PARALLEL 10;

ss_sc_all_map = FOREACH ss_sc_all_ord  GENERATE a, b, c;

STORE ss_sc_all_map INTO '$OUTPUTDIR/data/20100707' using Storage();

ss_sc_all_map_count = group ss_sc_all_map all;

count = FOREACH ss_sc_all_map_count GENERATE 'record_count' as 
record_count,COUNT($1);

STORE count INTO '$OUTPUTDIR/count/20100707' using PigStorage('\u0009');


I run this script using:

a) java -cp pig0.7.jar script.pig
b) java -cp pig0.7.jar -t PruneColumns script.pig

What I observe is that the alias count produces the same number of records 
but ss_sc_all_map have different sizes when run with above 2 options.

Is due to the fact that there are 2 store func's used?

Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1534) Code discovering UDFs in the script has a bug in a order by case

2010-08-04 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895522#action_12895522
 ] 

Pradeep Kamath commented on PIG-1534:
-

Ran all unit tests - TestScriptUDF fails but the failure is unrelated to the 
change in this patch and the failure occurs even with a fresh svn checkout.

Patch is ready for review.

 Code discovering UDFs in the script has a bug in a order by case
 

 Key: PIG-1534
 URL: https://issues.apache.org/jira/browse/PIG-1534
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.8.0

 Attachments: PIG-1534.patch


 Consider the following commandline:
 {noformat}
 java -cp /tmp/svncheckout/pig.jar:udf.jar:clusterdir org.apache.pig.Main -e 
 a = load 'studenttab' using udf.MyPigStorage(); b = order a by $0; dump b;
 {noformat}
 Notice there is no register udf.jar, instead udf.jar (which contains 
 udf.MyPigStorage) is in the classpath. Pig handles this case by shipping 
 udf.jar to the backend. However the above script with order by triggers the 
 bug with the following error message:
  ERROR 2997: Unable to recreate exception from backed error: 
 java.lang.RuntimeException: could not instantiate 
 'org.apache.pig.impl.builtin.RandomSampleLoader' with arguments 
 '[udf.MyPigStorage, 100]'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.