[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: (was: poregex2.patch) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Open (was: Patch Available) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Attachment: poregex2.patch PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Patch Available (was: Open) I have included changes suggested by Thejas. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790104#action_12790104 ] Hadoop QA commented on PIG-965: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427913/poregex2.patch against trunk revision 889870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/testReport/ Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/121/console This message is automatically generated. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-973: - Attachment: PIG-973.patch type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1148) Move splitable logic from pig latin to InputFormat
[ https://issues.apache.org/jira/browse/PIG-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790273#action_12790273 ] Pradeep Kamath commented on PIG-1148: - Hi Jeff, With the new load store redesign (http://wiki.apache.org/pig/LoadStoreRedesignProposal) wouldn't this be achieved implcitly since the splits used by pig will be the ones returned from the InputFormat associated with the Loader. The plan was to remove SPLIT by 'file' from the language since with the new load-store design it will not be possible to support this from pig - So there will be no splitable logic left with that approach - where you thinking of some other way to support split by file? Move splitable logic from pig latin to InputFormat -- Key: PIG-1148 URL: https://issues.apache.org/jira/browse/PIG-1148 Project: Pig Issue Type: Sub-task Reporter: Jeff Zhang Assignee: Jeff Zhang -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790334#action_12790334 ] Richard Ding commented on PIG-1090: --- The problem is that the getters/setters for the internal members are also defined for ResourceSchema. I felt that we should choose one way to access the internal members. Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790356#action_12790356 ] Dmitriy V. Ryaboy commented on PIG-1090: Richard, I added the getters/setters so that ResourceSchema can be treated as a POJO, and standard serialization tools can easily interact with it, in PIG-760. Alan said in PIG-760 that he is fine with adding getters and setters, but feels strongly that direct access to these members should still be allowed, for simplicity's sake. I'm fine with the visibility being either way, as long as the getters/setters stay in (although perhaps protected would be a better choice than private). Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790362#action_12790362 ] Richard Ding commented on PIG-1110: --- Hi Jeff, I think it's a good idea to ask users to specify their intension in PigStorage constructor (instead using file extensions). The issue with this approach, however, is that the arguments to PigStorage constructors can only be Strings so Pig determines the meanings of the arguments by their positions. Therefore we want to consider carefully what other arguments needed to add to the constructor in the future and what're their positions. As for foring users to add .bz2 as the extension of the output files, this is actually necessary since Hadoop LineRecordReader (used internally by PigStorage) finds the relevant compression codec for the given file based on its filename suffix. So for now users must specify .bz2 as the extension of the output files if they want to store the files as BZip files. Handle compressed file formats -- Gz, BZip with the new proposal Key: PIG-1110 URL: https://issues.apache.org/jira/browse/PIG-1110 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1110.patch, PIG_1110_Jeff.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1090) Update sources to reflect recent changes in load-store interfaces
[ https://issues.apache.org/jira/browse/PIG-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790387#action_12790387 ] Pradeep Kamath commented on PIG-1090: - I committed the latest patch - thanks Richard! Looks like I committed it while the above discussion was still on - if things need to be changed, please attach a small patch for the same and I can commit it - if we decide to keep things the way they are now that's fine. Update sources to reflect recent changes in load-store interfaces - Key: PIG-1090 URL: https://issues.apache.org/jira/browse/PIG-1090 Project: Pig Issue Type: Sub-task Reporter: Pradeep Kamath Assignee: Pradeep Kamath Attachments: PIG-1090-2.patch, PIG-1090-3.patch, PIG-1090.patch There have been some changes (as recorded in the Changes Section, Nov 2 2009 sub section of http://wiki.apache.org/pig/LoadStoreRedesignProposal) in the load/store interfaces - this jira is to track the task of making those changes under src. Changes under test will be addresses in a different jira. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1150) VAR() Variance UDF
VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.5.0 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790402#action_12790402 ] Hadoop QA commented on PIG-973: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427943/PIG-973.patch against trunk revision 889870. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 7 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/122/console This message is automatically generated. type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790403#action_12790403 ] Olga Natkovich commented on PIG-1150: - Yes, it is definitely worse while to contribute! VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.7.0 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1150: Fix Version/s: (was: 0.5.0) 0.7.0 Updating the fix version since it will go into the future version and will not be backported VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Fix For: 0.7.0 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1151) Data Conversion + Arithmetic UDFs
[ https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated PIG-1151: -- Priority: Minor (was: Major) Data Conversion + Arithmetic UDFs - Key: PIG-1151 URL: https://issues.apache.org/jira/browse/PIG-1151 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Reporter: sam rash Priority: Minor I would like to offer up some very simple data UDFs I have that wrap JodaTime (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate on ISO8601 date strings. (for piggybank). Please advise if these are appropriate. 1. Date Arithmetic takes an input string: 2009-01-01T13:43:33.000Z (and partial ones such as 2009-01-02) and a timespan (as millis or as string shorthand) returns an ISO8601 string that adjusts the input date by the specified timespan DatePlus(long timeMs); // + or - number works, is the # of millis DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc DateMinus(String timespan); //propose explicit minus when using string shorthand for time periods 2. Date Comparison (when you don't have full strings that you can use string compare with): DateIsBefore(String dateString); //true if lhs is before rhs DateIsAfter(String dateString); //true if lsh is after rhs 3. date trunc functions: takes partial ISO8601 strings and truncates to: toMinute(String dateString); toHour(String dateString); toDay(String dateString); toWeek(String dateString); toMonth(String dateString); toYear(String dateString); if any/all are helpful, I'm happy to contribute to pig -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1151) Data Conversion + Arithmetic UDFs
Data Conversion + Arithmetic UDFs - Key: PIG-1151 URL: https://issues.apache.org/jira/browse/PIG-1151 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Reporter: sam rash I would like to offer up some very simple data UDFs I have that wrap JodaTime (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate on ISO8601 date strings. (for piggybank). Please advise if these are appropriate. 1. Date Arithmetic takes an input string: 2009-01-01T13:43:33.000Z (and partial ones such as 2009-01-02) and a timespan (as millis or as string shorthand) returns an ISO8601 string that adjusts the input date by the specified timespan DatePlus(long timeMs); // + or - number works, is the # of millis DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc DateMinus(String timespan); //propose explicit minus when using string shorthand for time periods 2. Date Comparison (when you don't have full strings that you can use string compare with): DateIsBefore(String dateString); //true if lhs is before rhs DateIsAfter(String dateString); //true if lsh is after rhs 3. date trunc functions: takes partial ISO8601 strings and truncates to: toMinute(String dateString); toHour(String dateString); toDay(String dateString); toWeek(String dateString); toMonth(String dateString); toYear(String dateString); if any/all are helpful, I'm happy to contribute to pig -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1106) FR join should not spill
[ https://issues.apache.org/jira/browse/PIG-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1106: Resolution: Fixed Status: Resolved (was: Patch Available) patch committed. Thanks, Ankit! FR join should not spill Key: PIG-1106 URL: https://issues.apache.org/jira/browse/PIG-1106 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Ankit Modi Fix For: 0.7.0 Attachments: frjoin-nonspill.patch Currently, the values for the replicated side of the data are placed in a spillable bag (POFRJoin near line 275). This does not make sense because the whole point of the optimization is that the data on one side fits into memory. We already have a non-spillable bag implemented (NonSpillableDataBag.java) and we need to change FRJoin code to use it. And of course need to do lots of testing to make sure that we don't spill but die instead when we run out of memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1016) Reading in map data seems broken
[ https://issues.apache.org/jira/browse/PIG-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1016: Fix Version/s: (was: 0.5.0) 0.7.0 Reading in map data seems broken Key: PIG-1016 URL: https://issues.apache.org/jira/browse/PIG-1016 Project: Pig Issue Type: Improvement Components: data Affects Versions: 0.4.0 Reporter: hc busy Fix For: 0.7.0 Attachments: PIG-1016.patch Hi, I'm trying to load a map that has a tuple for value. The read fails in 0.4.0 because of a misconfiguration in the parser. Where as in almost all documentation it is stated that value of the map can be any time. I've attached a patch that allows us to read in complex objects as value as documented. I've done simple verification of loading in maps with tuple/map values and writing them back out using LOAD and STORE. All seems to work fine. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1054) Pig Site - updates for 5.0
[ https://issues.apache.org/jira/browse/PIG-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1054: Resolution: Fixed Status: Resolved (was: Patch Available) The changes has already been applied. Pig Site - updates for 5.0 -- Key: PIG-1054 URL: https://issues.apache.org/jira/browse/PIG-1054 Project: Pig Issue Type: Task Components: documentation Affects Versions: 0.5.0 Reporter: Corinne Chandel Priority: Blocker Fix For: 0.5.0 Attachments: pig-1054.patch Pig Site - updates for 5.0 remove broken link update formatting for headers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Attachment: PIG_1143.patch Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Status: Patch Available (was: Open) Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1082) Modify Comparator to work with a typed textual Storage
[ https://issues.apache.org/jira/browse/PIG-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1082: Fix Version/s: (was: 0.0.0) Modify Comparator to work with a typed textual Storage -- Key: PIG-1082 URL: https://issues.apache.org/jira/browse/PIG-1082 Project: Pig Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: hc busy Attachments: PIG-1082.patch Original Estimate: 5h Remaining Estimate: 5h See parent bug. This ticket is for just the comparator change, which needs to be made in order for the nested data structures to sort right -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790494#action_12790494 ] Olga Natkovich commented on PIG-973: +1 on the changes. I will be committing the patch shortly type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1151) Date Conversion + Arithmetic UDFs
[ https://issues.apache.org/jira/browse/PIG-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam rash updated PIG-1151: -- Summary: Date Conversion + Arithmetic UDFs (was: Data Conversion + Arithmetic UDFs) Date Conversion + Arithmetic UDFs - Key: PIG-1151 URL: https://issues.apache.org/jira/browse/PIG-1151 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Reporter: sam rash Priority: Minor I would like to offer up some very simple data UDFs I have that wrap JodaTime (apache 2.0 license, http://joda-time.sourceforge.net/license.html) and operate on ISO8601 date strings. (for piggybank). Please advise if these are appropriate. 1. Date Arithmetic takes an input string: 2009-01-01T13:43:33.000Z (and partial ones such as 2009-01-02) and a timespan (as millis or as string shorthand) returns an ISO8601 string that adjusts the input date by the specified timespan DatePlus(long timeMs); // + or - number works, is the # of millis DatePlus(String timespan); //10m = 10 minutes, 1h = 1 hour, 1172 ms, etc DateMinus(String timespan); //propose explicit minus when using string shorthand for time periods 2. Date Comparison (when you don't have full strings that you can use string compare with): DateIsBefore(String dateString); //true if lhs is before rhs DateIsAfter(String dateString); //true if lsh is after rhs 3. date trunc functions: takes partial ISO8601 strings and truncates to: toMinute(String dateString); toHour(String dateString); toDay(String dateString); toWeek(String dateString); toMonth(String dateString); toYear(String dateString); if any/all are helpful, I'm happy to contribute to pig -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match
[ https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1075: -- Attachment: PIG-1075.patch This patch moves the error up to the parser and gives a better error message for cogroup statement with incompatible group types: Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1110: Cogroup column 1 has incompatible types: chararray versus int at org.apache.pig.impl.logicalLayer.LOCogroup.getTupleGroupBySchema(LOCogroup.java:499) at org.apache.pig.impl.logicalLayer.LOCogroup.getSchema(LOCogroup.java:325) at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:779) Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Attachments: PIG-1075.patch When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1075) Error in Cogroup when key fields types don't match
[ https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1075: - Assignee: Richard Ding Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Assignee: Richard Ding Attachments: PIG-1075.patch When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1075) Error in Cogroup when key fields types don't match
[ https://issues.apache.org/jira/browse/PIG-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1075: -- Status: Patch Available (was: Open) Error in Cogroup when key fields types don't match -- Key: PIG-1075 URL: https://issues.apache.org/jira/browse/PIG-1075 Project: Pig Issue Type: Bug Affects Versions: 0.5.0 Reporter: Ankur Assignee: Richard Ding Attachments: PIG-1075.patch When Cogrouping 2 relations on multiple key fields, pig throws an error if the corresponding types don't match. Consider the following script:- A = LOAD 'data' USING PigStorage() as (a:chararray, b:int, c:int); B = LOAD 'data' USING PigStorage() as (a:chararray, b:chararray, c:int); C = CoGROUP A BY (a,b,c), B BY (a,b,c); D = FOREACH C GENERATE FLATTEN(A), FLATTEN(B); describe D; dump D; The complete stack trace of the error thrown is Pig Stack Trace --- ERROR 1051: Cannot cast to Unknown org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1001: Unable to describe schema for alias D at org.apache.pig.PigServer.dumpSchema(PigServer.java:436) at org.apache.pig.tools.grunt.GruntParser.processDescribe(GruntParser.java:233) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:253) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89) at org.apache.pig.Main.main(Main.java:397) Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR 0: An unexpected exception caused the validation to stop at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:104) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:40) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingValidator.validate(TypeCheckingValidator.java:30) at org.apache.pig.impl.logicalLayer.validators.LogicalPlanValidationExecutor.validate(LogicalPlanValidationExecutor.java:83) at org.apache.pig.PigServer.compileLp(PigServer.java:821) at org.apache.pig.PigServer.dumpSchema(PigServer.java:428) ... 6 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1060: Cannot resolve COGroup output schema at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2463) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:372) at org.apache.pig.impl.logicalLayer.LOCogroup.visit(LOCogroup.java:45) at org.apache.pig.impl.plan.DependencyOrderWalker.walk(DependencyOrderWalker.java:69) at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51) at org.apache.pig.impl.plan.PlanValidator.validateSkipCollectException(PlanValidator.java:101) ... 11 more Caused by: org.apache.pig.impl.logicalLayer.validators.TypeCheckerException: ERROR 1051: Cannot cast to Unknown at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.insertAtomicCastForCOGroupInnerPlan(TypeCheckingVisitor.java:2552) at org.apache.pig.impl.logicalLayer.validators.TypeCheckingVisitor.visit(TypeCheckingVisitor.java:2451) ... 16 more The error message does not help the user in identifying the issue clearly especially if the pig script is large and complex. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-973: --- Resolution: Fixed Status: Resolved (was: Patch Available) patch committed. thanks, Richard type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-973) type resolution inconsistency
[ https://issues.apache.org/jira/browse/PIG-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-973: --- Fix Version/s: 0.7.0 type resolution inconsistency - Key: PIG-973 URL: https://issues.apache.org/jira/browse/PIG-973 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.7.0 Attachments: PIG-973.patch This script works: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; describe F; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G.$0);} dump C; This one produces an error: A = load 'test' using PigStorage(':') as (name: chararray, age: int, gpa: float); B = group A by age; C = foreach B { D = filter A by gpa 2.5; E = order A by name; F = A.age; G = distinct F; generate group, COUNT(D), MAX (E.name), MIN(G);} dump C; Notice the difference in how MIN is passed the data. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1144) set default_parallelism construct does not set the number of reducers correctly
[ https://issues.apache.org/jira/browse/PIG-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-1144: Status: Open (was: Patch Available) set default_parallelism construct does not set the number of reducers correctly --- Key: PIG-1144 URL: https://issues.apache.org/jira/browse/PIG-1144 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.6.0 Environment: Hadoop 20 cluster with multi-node installation Reporter: Viraj Bhat Assignee: Daniel Dai Fix For: 0.7.0 Attachments: brokenparallel.out, genericscript_broken_parallel.pig, PIG-1144-1.patch, PIG-1144-2.patch, PIG-1144-3.patch Hi all, I have a Pig script where I set the parallelism using the following set construct: set default_parallel 100 . I modified the MRPrinter.java to printout the parallelism {code} ... public void visitMROp(MapReduceOper mr) mStream.println(MapReduce node + mr.getOperatorKey().toString() + Parallelism + mr.getRequestedParallelism()); ... {code} When I run an explain on the script, I see that the last job which does the actual sort, runs as a single reducer job. This can be corrected, by adding the PARALLEL keyword in front of the ORDER BY. Attaching the script and the explain output Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1110) Handle compressed file formats -- Gz, BZip with the new proposal
[ https://issues.apache.org/jira/browse/PIG-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790529#action_12790529 ] Jeff Zhang commented on PIG-1110: - Response to Richard, 1. If you worry about the API compatibility of PigStorage() since PigStorage() is the default LoadFunc of Pig, there's another option that we can provide another LoadFunc having the ability of compression, I mean we can create a new LoadFunc such as Bz2PigStorage(). 2. Actually the file name in Store statement is the folder name not the file name, we will get part-0.bz2 under this folder. The part-0.bz2 is the real file which is consumed by hadoop. Hadoop will check the file name rather the folder name to determine the compression codec. Handle compressed file formats -- Gz, BZip with the new proposal Key: PIG-1110 URL: https://issues.apache.org/jira/browse/PIG-1110 Project: Pig Issue Type: Sub-task Reporter: Richard Ding Assignee: Richard Ding Attachments: PIG-1110.patch, PIG_1110_Jeff.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankit Modi updated PIG-965: --- Status: Open (was: Patch Available) PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)
[ https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790545#action_12790545 ] Ankit Modi commented on PIG-965: * NonConstantRegex - I did not think of equals. But I added a length check before as it could find out change in length faster and to best of my knowledge its a getMethod. And yes as you mentioned equals will check for same object and instanceOf which is not useful in our case. * The numbers published above are using dk.brics.automaton.RunAutomaton. Do you want me to publish numbers for more set of regexs ? I'll create a patch for rest of the comments. PERFORMANCE: optimize common case in matches (PORegex) -- Key: PIG-965 URL: https://issues.apache.org/jira/browse/PIG-965 Project: Pig Issue Type: Improvement Components: impl Reporter: Thejas M Nair Assignee: Ankit Modi Attachments: automaton.jar, poregex2.patch Some frequently seen use cases of 'matches' comparison operator have follow properties - 1. The rhs is a constant string . eg c1 matches 'abc%' 2. Regexes such that look for matching prefix , suffix etc are very common. eg - abc%', %abc, '%abc%' To optimize for these common cases , PORegex.java can be changed to - 1. Compile the pattern (rhs of matches) re-use it if the pattern string has not changed. 2. Use string comparisons for simple common regexes (in 2 above). The implementation of Hive like clause uses similar optimizations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1149: --- Attachment: pig_1149.patch Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1149) Allow instantiation of SampleLoaders with parametrized LoadFuncs
[ https://issues.apache.org/jira/browse/PIG-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1149: --- Status: Patch Available (was: Open) Due to the string being parsed a few times along the way, three backslashes need to precede the escaped quote in PigLatin. Which means six backslashes when expressing PigLatin as a string in Java. Allow instantiation of SampleLoaders with parametrized LoadFuncs Key: PIG-1149 URL: https://issues.apache.org/jira/browse/PIG-1149 Project: Pig Issue Type: Bug Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.7.0 Attachments: pig_1149.patch Currently, it is not possible to instantiate a SampleLoader with something like PigStorage(':'). We should allow passing parameters to the loaders being sampled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790566#action_12790566 ] Hadoop QA commented on PIG-1143: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch against trunk revision 890553. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console This message is automatically generated. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1152) bincond operator throws parser error
bincond operator throws parser error Key: PIG-1152 URL: https://issues.apache.org/jira/browse/PIG-1152 Project: Pig Issue Type: Bug Affects Versions: 0.6.0 Reporter: Ankur Bincond operator throws parser error when true condition contains a constant bag with 1 tuple containing a single field of int type with -ve value. Here is the script to reproduce the issue A = load 'A' as (s: chararray, x: int, y: int); B = group A by s; C = foreach B generate group, flatten(((COUNT(A) 1L) ? {(-1)} : A.x)); dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.