[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: (was: poregex2.patch)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Attachment: poregex2.patch

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Patch Available  (was: Open)

I have included changes suggested by Thejas.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790104#action_12790104
 ] 

Hadoop QA commented on PIG-965:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to cause Findbugs to fail.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console

This message is automatically generated.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-973:
-

Attachment: PIG-973.patch

 type resolution inconsistency
 -

 Key: PIG-973
 URL: https://issues.apache.org/jira/browse/PIG-973
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-973.patch


 This script works:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
describe F;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
 dump C;
 This one produces an error:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G);}
 dump C;
 Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1148) Move splitable logic from pig latin to InputFormat

2009-12-14 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790273#action_12790273
 ] 

Pradeep Kamath commented on PIG-1148:
-

Hi Jeff,
  With the new load store redesign 
(http://wiki.apache.org/pig/LoadStoreRedesignProposal) wouldn't this be 
achieved implcitly since the splits used by pig will be the ones returned from 
the InputFormat associated with the Loader. The plan was to remove SPLIT by 
'file' from the language since with the new load-store design it will not be 
possible to support this from pig - So there will be no splitable logic left 
with that approach - where you thinking of some other way to support split by 
file?

 Move splitable logic from pig latin to InputFormat
 --

 Key: PIG-1148
 URL: https://issues.apache.org/jira/browse/PIG-1148
 Project: Pig
  Issue Type: Sub-task
Reporter: Jeff Zhang
Assignee: Jeff Zhang



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790334#action_12790334
 ] 

Richard Ding commented on PIG-1090:
---

The problem is that the getters/setters for the internal members are also 
defined for ResourceSchema. I felt that we should choose one way to access the 
internal members.  

 Update sources to reflect recent changes in load-store interfaces
 -

 Key: PIG-1090
 URL: https://issues.apache.org/jira/browse/PIG-1090
 Project: Pig
  Issue Type: Sub-task
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch


 There have been some changes (as recorded in the Changes Section, Nov 2 2009 
 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
 load/store interfaces - this jira is to track the task of making those 
 changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790356#action_12790356
 ] 

Dmitriy V. Ryaboy commented on PIG-1090:


Richard, 
I added the getters/setters so that ResourceSchema can be treated as a POJO, 
and standard serialization tools can easily interact with it, in PIG-760. 
Alan said in PIG-760 that he is fine with adding getters and setters, but feels 
strongly that direct access to these members should still be allowed, for 
simplicity's sake.
I'm fine with the visibility being either way, as long as the getters/setters 
stay in (although perhaps protected would be a better choice than private).


 Update sources to reflect recent changes in load-store interfaces
 -

 Key: PIG-1090
 URL: https://issues.apache.org/jira/browse/PIG-1090
 Project: Pig
  Issue Type: Sub-task
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch


 There have been some changes (as recorded in the Changes Section, Nov 2 2009 
 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
 load/store interfaces - this jira is to track the task of making those 
 changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-14 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790362#action_12790362
 ] 

Richard Ding commented on PIG-1110:
---

Hi Jeff, I think it's a good idea to ask users to specify their intension in 
PigStorage constructor (instead using file extensions). The issue with this 
approach, however,  is that the arguments to PigStorage constructors can only 
be Strings so Pig determines the meanings of the arguments by their positions. 
Therefore we want to consider carefully what other arguments needed to add to 
the constructor in the future and what're their positions.

As for foring users to add .bz2 as the extension of the output files, this is 
actually necessary since Hadoop LineRecordReader (used internally by 
PigStorage) finds the relevant compression codec for the given file based on 
its filename suffix. So for now users must specify .bz2 as the extension of the 
output files if they want to store the files as BZip files.

 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch, PIG_1110_Jeff.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)
Allow instantiation of SampleLoaders with parametrized LoadFuncs


 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0


Currently, it is not possible to instantiate a SampleLoader with something like 
PigStorage(':').  We should allow passing parameters to the loaders being 
sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces

2009-12-14 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790387#action_12790387
 ] 

Pradeep Kamath commented on PIG-1090:
-

I committed the latest patch - thanks Richard! Looks like I committed it while 
the above discussion was still on - if things need to be changed, please attach 
a small patch for the same and I can commit it - if we decide to keep things 
the way they are now that's fine.

 Update sources to reflect recent changes in load-store interfaces
 -

 Key: PIG-1090
 URL: https://issues.apache.org/jira/browse/PIG-1090
 Project: Pig
  Issue Type: Sub-task
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch


 There have been some changes (as recorded in the Changes Section, Nov 2 2009 
 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the 
 load/store interfaces - this jira is to track the task of making those 
 changes under src. Changes under test will be addresses in a different jira.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Russell Jurney (JIRA)
VAR() Variance UDF
--

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.5.0


I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
variance in a distributed manner, based on the AVG() builtin.  It works by 
calculating the count, sum and sum of squares, as described here: 
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

Is this a worthwhile contribution?  Taking the square root of this value using 
the contrib SQRT() function gives Standard Deviation, which is missing from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-973) type resolution inconsistency

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790402#action_12790402
 ] 

Hadoop QA commented on PIG-973:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427943/PIG-973.patch
  against trunk revision 889870.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 7 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/console

This message is automatically generated.

 type resolution inconsistency
 -

 Key: PIG-973
 URL: https://issues.apache.org/jira/browse/PIG-973
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-973.patch


 This script works:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
describe F;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
 dump C;
 This one produces an error:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G);}
 dump C;
 Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790403#action_12790403
 ] 

Olga Natkovich commented on PIG-1150:
-

Yes, it is definitely worse while to contribute!

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.7.0


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1150) VAR() Variance UDF

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1150:


Fix Version/s: (was: 0.5.0)
   0.7.0

Updating the fix version since it will go into the future version and will not 
be backported

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
 Fix For: 0.7.0


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1151) Data Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated PIG-1151:
--

Priority: Minor  (was: Major)

 Data Conversion + Arithmetic UDFs
 -

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
Reporter: sam rash
Priority: Minor

 I would like to offer up some very simple data UDFs I have that wrap JodaTime 
 (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
 operate on ISO8601 date strings.
 (for piggybank).  Please advise if these are appropriate.
 1. Date Arithmetic
 takes an input string: 
 2009-01-01T13:43:33.000Z
 (and partial ones such as 2009-01-02)
 and a timespan (as millis or as string shorthand)
 returns an ISO8601 string that adjusts the input date by the specified 
 timespan
 DatePlus(long timeMs); // + or - number works, is the # of millis
 DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
 DateMinus(String timespan); //propose explicit minus when using string 
 shorthand for time periods
 2. Date Comparison (when you don't have full strings that you can use string 
 compare with):
 DateIsBefore(String dateString); //true if lhs is before rhs
 DateIsAfter(String dateString); //true if lsh is after rhs
 3. date trunc functions:
 takes partial ISO8601 strings and truncates to:
 toMinute(String dateString);
 toHour(String dateString);
 toDay(String dateString);
 toWeek(String dateString);
 toMonth(String dateString);
 toYear(String dateString);
 if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1151) Data Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)
Data Conversion + Arithmetic UDFs
-

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
Reporter: sam rash


I would like to offer up some very simple data UDFs I have that wrap JodaTime 
(apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate 
on ISO8601 date strings.
(for piggybank).  Please advise if these are appropriate.

1. Date Arithmetic

takes an input string: 

2009-01-01T13:43:33.000Z
(and partial ones such as 2009-01-02)

and a timespan (as millis or as string shorthand)

returns an ISO8601 string that adjusts the input date by the specified timespan

DatePlus(long timeMs); // + or - number works, is the # of millis
DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
DateMinus(String timespan); //propose explicit minus when using string 
shorthand for time periods

2. Date Comparison (when you don't have full strings that you can use string 
compare with):

DateIsBefore(String dateString); //true if lhs is before rhs
DateIsAfter(String dateString); //true if lsh is after rhs

3. date trunc functions:

takes partial ISO8601 strings and truncates to:

toMinute(String dateString);
toHour(String dateString);
toDay(String dateString);
toWeek(String dateString);
toMonth(String dateString);
toYear(String dateString);

if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1106) FR join should not spill

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1106:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. Thanks, Ankit!

 FR join should not spill
 

 Key: PIG-1106
 URL: https://issues.apache.org/jira/browse/PIG-1106
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Ankit Modi
 Fix For: 0.7.0

 Attachments: frjoin-nonspill.patch


 Currently, the values for the replicated side of the data are placed in a 
 spillable bag (POFRJoin near line 275). This does not make sense because the 
 whole point of the optimization is that the data on one side fits into 
 memory. We already have a non-spillable bag implemented 
 (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And 
 of course need to do lots of testing to make sure that we don't spill but die 
 instead when we run out of memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1016) Reading in map data seems broken

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1016:


Fix Version/s: (was: 0.5.0)
   0.7.0

 Reading in map data seems broken
 

 Key: PIG-1016
 URL: https://issues.apache.org/jira/browse/PIG-1016
 Project: Pig
  Issue Type: Improvement
  Components: data
Affects Versions: 0.4.0
Reporter: hc busy
 Fix For: 0.7.0

 Attachments: PIG-1016.patch


 Hi, I'm trying to load a map that has a tuple for value. The read fails in 
 0.4.0 because of a misconfiguration in the parser. Where as in almost all 
 documentation it is stated that value of the map can be any time.
 I've attached a patch that allows us to read in complex objects as value as 
 documented. I've done simple verification of loading in maps with tuple/map 
 values and writing them back out using LOAD and STORE. All seems to work fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1054) Pig Site - updates for 5.0

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1054:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

The changes has already been applied.

 Pig Site - updates for 5.0
 --

 Key: PIG-1054
 URL: https://issues.apache.org/jira/browse/PIG-1054
 Project: Pig
  Issue Type: Task
  Components: documentation
Affects Versions: 0.5.0
Reporter: Corinne Chandel
Priority: Blocker
 Fix For: 0.5.0

 Attachments: pig-1054.patch


 Pig Site - updates for 5.0
  remove broken link
  update formatting for headers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Attachment: PIG_1143.patch

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Status: Patch Available  (was: Open)

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1082:


Fix Version/s: (was: 0.0.0)

 Modify Comparator to work with a typed textual Storage
 --

 Key: PIG-1082
 URL: https://issues.apache.org/jira/browse/PIG-1082
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.4.0
Reporter: hc busy
 Attachments: PIG-1082.patch

   Original Estimate: 5h
  Remaining Estimate: 5h

 See parent bug. This ticket is for just the comparator change, which needs to 
 be made in order for the nested data structures to sort right

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790494#action_12790494
 ] 

Olga Natkovich commented on PIG-973:


+1 on the changes. I will be committing the patch shortly

 type resolution inconsistency
 -

 Key: PIG-973
 URL: https://issues.apache.org/jira/browse/PIG-973
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-973.patch


 This script works:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
describe F;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
 dump C;
 This one produces an error:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G);}
 dump C;
 Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1151) Date Conversion + Arithmetic UDFs

2009-12-14 Thread sam rash (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam rash updated PIG-1151:
--

Summary: Date Conversion + Arithmetic UDFs  (was: Data Conversion + 
Arithmetic UDFs)

 Date Conversion + Arithmetic UDFs
 -

 Key: PIG-1151
 URL: https://issues.apache.org/jira/browse/PIG-1151
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
Reporter: sam rash
Priority: Minor

 I would like to offer up some very simple data UDFs I have that wrap JodaTime 
 (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and 
 operate on ISO8601 date strings.
 (for piggybank).  Please advise if these are appropriate.
 1. Date Arithmetic
 takes an input string: 
 2009-01-01T13:43:33.000Z
 (and partial ones such as 2009-01-02)
 and a timespan (as millis or as string shorthand)
 returns an ISO8601 string that adjusts the input date by the specified 
 timespan
 DatePlus(long timeMs); // + or - number works, is the # of millis
 DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc
 DateMinus(String timespan); //propose explicit minus when using string 
 shorthand for time periods
 2. Date Comparison (when you don't have full strings that you can use string 
 compare with):
 DateIsBefore(String dateString); //true if lhs is before rhs
 DateIsAfter(String dateString); //true if lsh is after rhs
 3. date trunc functions:
 takes partial ISO8601 strings and truncates to:
 toMinute(String dateString);
 toHour(String dateString);
 toDay(String dateString);
 toWeek(String dateString);
 toMonth(String dateString);
 toYear(String dateString);
 if any/all are helpful, I'm happy to contribute to pig

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1075:
--

Attachment: PIG-1075.patch

This patch moves the error up to the parser and gives a better error message 
for cogroup statement with incompatible group types:

Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1110: 
Cogroup column 1 has incompatible types: chararray versus int
at 
org.apache.pig.impl.logicalLayer.LOCogroup.getTupleGroupBySchema(LOCogroup.java:499)
at 
org.apache.pig.impl.logicalLayer.LOCogroup.getSchema(LOCogroup.java:325)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:779)

 Error in Cogroup when key fields types don't match
 --

 Key: PIG-1075
 URL: https://issues.apache.org/jira/browse/PIG-1075
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur
 Attachments: PIG-1075.patch


 When Cogrouping 2 relations on multiple key fields, pig throws an error if 
 the corresponding types don't match. 
 Consider the following script:-
 A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
 B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
 C = CoGROUP A BY (a,b,c), B BY (a,b,c);
 D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
 describe D;
 dump D;
 The complete stack trace of the error thrown is
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to Unknown
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
 describe schema for alias D
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
 unexpected exception caused the validation to stop
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
 at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
 at org.apache.pig.PigServer.compileLp(PigServer.java:821)
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
 ... 6 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1060: Cannot resolve COGroup output schema
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
 ... 11 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to Unknown
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
 ... 16 more
 The error message does not help the user in identifying the issue clearly 
 especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding reassigned PIG-1075:
-

Assignee: Richard Ding

 Error in Cogroup when key fields types don't match
 --

 Key: PIG-1075
 URL: https://issues.apache.org/jira/browse/PIG-1075
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur
Assignee: Richard Ding
 Attachments: PIG-1075.patch


 When Cogrouping 2 relations on multiple key fields, pig throws an error if 
 the corresponding types don't match. 
 Consider the following script:-
 A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
 B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
 C = CoGROUP A BY (a,b,c), B BY (a,b,c);
 D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
 describe D;
 dump D;
 The complete stack trace of the error thrown is
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to Unknown
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
 describe schema for alias D
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
 unexpected exception caused the validation to stop
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
 at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
 at org.apache.pig.PigServer.compileLp(PigServer.java:821)
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
 ... 6 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1060: Cannot resolve COGroup output schema
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
 ... 11 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to Unknown
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
 ... 16 more
 The error message does not help the user in identifying the issue clearly 
 especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match

2009-12-14 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1075:
--

Status: Patch Available  (was: Open)

 Error in Cogroup when key fields types don't match
 --

 Key: PIG-1075
 URL: https://issues.apache.org/jira/browse/PIG-1075
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.5.0
Reporter: Ankur
Assignee: Richard Ding
 Attachments: PIG-1075.patch


 When Cogrouping 2 relations on multiple key fields, pig throws an error if 
 the corresponding types don't match. 
 Consider the following script:-
 A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int);
 B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int);
 C = CoGROUP A BY (a,b,c), B BY (a,b,c);
 D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B);
 describe D;
 dump D;
 The complete stack trace of the error thrown is
 Pig Stack Trace
 ---
 ERROR 1051: Cannot cast to Unknown
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to 
 describe schema for alias D
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:436)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
 at org.apache.pig.Main.main(Main.java:397)
 Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An 
 unexpected exception caused the validation to stop
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30)
 at 
 org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83)
 at org.apache.pig.PigServer.compileLp(PigServer.java:821)
 at org.apache.pig.PigServer.dumpSchema(PigServer.java:428)
 ... 6 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1060: Cannot resolve COGroup output schema
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463)
 at 
 org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372)
 at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45)
 at 
 org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101)
 ... 11 more
 Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: 
 ERROR 1051: Cannot cast to Unknown
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552)
 at 
 org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451)
 ... 16 more
 The error message does not help the user in identifying the issue clearly 
 especially if the pig script is large and complex.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-973:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

patch committed. thanks, Richard

 type resolution inconsistency
 -

 Key: PIG-973
 URL: https://issues.apache.org/jira/browse/PIG-973
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Attachments: PIG-973.patch


 This script works:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
describe F;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
 dump C;
 This one produces an error:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G);}
 dump C;
 Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-973) type resolution inconsistency

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-973:
---

Fix Version/s: 0.7.0

 type resolution inconsistency
 -

 Key: PIG-973
 URL: https://issues.apache.org/jira/browse/PIG-973
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.7.0

 Attachments: PIG-973.patch


 This script works:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
describe F;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G.$0);}
 dump C;
 This one produces an error:
 A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: 
 float);
 B = group A by age;
 C = foreach B {
D = filter A by gpa  2.5;
E = order A by name;
F = A.age;
G = distinct F;
generate group, COUNT(D), MAX (E.name), MIN(G);}
 dump C;
 Notice the difference in how MIN is passed the data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly

2009-12-14 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1144:


Status: Open  (was: Patch Available)

 set default_parallelism construct does not set the number of reducers 
 correctly
 ---

 Key: PIG-1144
 URL: https://issues.apache.org/jira/browse/PIG-1144
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
 Environment: Hadoop 20 cluster with multi-node installation
Reporter: Viraj Bhat
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, 
 PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch


 Hi all,
  I have a Pig script where I set the parallelism using the following set 
 construct: set default_parallel 100 . I modified the MRPrinter.java to 
 printout the parallelism
 {code}
 ...
 public void visitMROp(MapReduceOper mr)
 mStream.println(MapReduce node  + mr.getOperatorKey().toString() +  
 Parallelism  + mr.getRequestedParallelism());
 ...
 {code}
 When I run an explain on the script, I see that the last job which does the 
 actual sort, runs as a single reducer job. This can be corrected, by adding 
 the PARALLEL keyword in front of the ORDER BY.
 Attaching the script and the explain output
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal

2009-12-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790529#action_12790529
 ] 

Jeff Zhang commented on PIG-1110:
-

Response to Richard,

1. If you worry about the API compatibility of PigStorage() since PigStorage() 
is the default LoadFunc of Pig,  there's another option that we can provide 
another LoadFunc having the ability of compression, I mean we can create a new 
LoadFunc such as Bz2PigStorage(). 

2. Actually the file name in Store statement is the folder name not the file 
name, we will get part-0.bz2 under this folder. The part-0.bz2 is the 
real file which is consumed by hadoop. Hadoop will check the file name rather 
the folder name to determine the compression codec.



 Handle compressed file formats -- Gz, BZip with the new proposal
 

 Key: PIG-1110
 URL: https://issues.apache.org/jira/browse/PIG-1110
 Project: Pig
  Issue Type: Sub-task
Reporter: Richard Ding
Assignee: Richard Ding
 Attachments: PIG-1110.patch, PIG_1110_Jeff.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Modi updated PIG-965:
---

Status: Open  (was: Patch Available)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2009-12-14 Thread Ankit Modi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790545#action_12790545
 ] 

Ankit Modi commented on PIG-965:


* NonConstantRegex - I did not think of equals. But I added a length check 
before as it could find out change in length faster and to best of my knowledge 
its a getMethod. And yes as you mentioned equals will check for same object and 
instanceOf which is not useful in our case.

* The numbers published above are using dk.brics.automaton.RunAutomaton. Do you 
want me to publish numbers for more set of regexs ?

I'll create a patch for rest of the comments.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1149:
---

Attachment: pig_1149.patch

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs

2009-12-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1149:
---

Status: Patch Available  (was: Open)

Due to the string being parsed a few times along the way, three backslashes 
need to precede the escaped quote in PigLatin. Which means six backslashes when 
expressing PigLatin as a string in Java. 

 Allow instantiation of SampleLoaders with parametrized LoadFuncs
 

 Key: PIG-1149
 URL: https://issues.apache.org/jira/browse/PIG-1149
 Project: Pig
  Issue Type: Bug
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.7.0

 Attachments: pig_1149.patch


 Currently, it is not possible to instantiate a SampleLoader with something 
 like PigStorage(':').  We should allow passing parameters to the loaders 
 being sampled.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790566#action_12790566
 ] 

Hadoop QA commented on PIG-1143:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch
  against trunk revision 890553.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console

This message is automatically generated.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1152) bincond operator throws parser error

2009-12-14 Thread Ankur (JIRA)
bincond operator throws parser error


 Key: PIG-1152
 URL: https://issues.apache.org/jira/browse/PIG-1152
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Ankur


Bincond operator throws parser error when true condition contains a constant 
bag with 1 tuple containing a single field of int type with -ve value. 

Here is the script to reproduce the issue

A = load 'A' as (s: chararray, x: int, y: int);
B = group A by s;
C = foreach B generate group, flatten(((COUNT(A)  1L) ? {(-1)} : A.x));
dump C;


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.