[jira] Updated: (PIG-732) Utility UDFs
[ https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur updated PIG-732: -- Attachment: udf.v5.patch Minor issue in test case, causing test failure. Fixed in latest upload - udf.v5.patch. Also changed TopN to Top. Should be good to go now. Utility UDFs - Key: PIG-732 URL: https://issues.apache.org/jira/browse/PIG-732 Project: Pig Issue Type: New Feature Reporter: Ankur Priority: Minor Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, udf.v5.patch Two utility UDFs and their respective test cases. 1. TopN - Accepts number of tuples (N) to retain in output, field number (type long) to use for comparison, and an sorted/unsorted bag of tuples. It outputs a bag containing top N tuples. 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines (Yahoo, Google, AOL, Live) and extracts and normalizes the search query present in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-789) coupling load and store in script no longer works
[ https://issues.apache.org/jira/browse/PIG-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gunther Hagleitner updated PIG-789: --- Attachment: dump_bug.patch Both dump (openIterator) and illustrate (getExamples) show this problem. dump_bug.patch contains a fix; The patch is for the trunk. coupling load and store in script no longer works - Key: PIG-789 URL: https://issues.apache.org/jira/browse/PIG-789 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Alan Gates Assignee: Gunther Hagleitner Attachments: dump_bug.patch Many user's pig script do something like this: a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); c = filter a by age 500; e = group c by (name, age); f = foreach e generate group, COUNT($1); store f into 'bla'; f1 = load 'bla'; g = order f1 by $1; dump g; With the inclusion of the multi-query phase2 patch this appears to no longer work. You get an error: 2009-04-28 18:24:50,776 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2100: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/gates/bla does not exist. We shouldn't be checking for bla's existence here because it will be created eventually by the script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
[ https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704672#action_12704672 ] Alan Gates commented on PIG-741: Since limit distributes rather nicely, I'd very much like it to use the combiner. But after looking at the code for a bit I realized I could wait for the work Santosh is doing on the optimizer and use that (see PIG-697) or rewrite a bunch of that code myself. I decided it was better to check in a limited version of limit (hah) now and get the combiner functionality in a month or two. Glad to hear it will work for you now. Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Alan Gates Fix For: 0.3.0 Attachments: PIG-741.patch I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
[ https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704685#action_12704685 ] Olga Natkovich commented on PIG-741: +1 on the patch with one question: is there a reason why tests were only added for local mode? Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Alan Gates Fix For: 0.3.0 Attachments: PIG-741.patch I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH
[ https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704775#action_12704775 ] Alan Gates commented on PIG-741: I only added tests for local mode because inner operators are executed in local mode one way or another, so I didn't think there was a need to test it in the map reduce case as well. Add LIMIT as a statement that works in nested FOREACH - Key: PIG-741 URL: https://issues.apache.org/jira/browse/PIG-741 Project: Pig Issue Type: New Feature Reporter: David Ciemiewicz Assignee: Alan Gates Fix For: 0.3.0 Attachments: PIG-741.patch I'd like to compute the top 10 results in each group. The natural way to express this in Pig would be: {code} A = load '...' using PigStorage() as ( date: int, count: int, url: chararray ); B = group A by ( date ); C = foreach B { D = order A by count desc; E = limit D 10; generate FLATTEN(E); }; dump C; {code} Yeah, I could write a UDF / PiggyBank function to take the top n results. But since LIMIT already exists as a statement, it seems like it should also work in the nested foreach context. Example workaround code. {code} C = foreach B { D = order A by count desc; E = util.TOP(D, 10); generate FLATTEN(E); }; dump C; {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-793) Improving memory efficiency of Tuple implementation
Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-732) Utility UDFs
[ https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olga Natkovich updated PIG-732: --- Resolution: Fixed Status: Resolved (was: Patch Available) Patch committed. Thanks, Ankur for contributing. Utility UDFs - Key: PIG-732 URL: https://issues.apache.org/jira/browse/PIG-732 Project: Pig Issue Type: New Feature Reporter: Ankur Priority: Minor Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, udf.v5.patch Two utility UDFs and their respective test cases. 1. TopN - Accepts number of tuples (N) to retain in output, field number (type long) to use for comparison, and an sorted/unsorted bag of tuples. It outputs a bag containing top N tuples. 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines (Yahoo, Google, AOL, Live) and extracts and normalizes the search query present in it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-794) Use Avro serialization in Pig
Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Gaudet updated PIG-795: Attachment: sample2.diff This patch implements the SAMPLE command. It basically add a random sample mode to the LIMIT class. The syntax is like LIMIT: a = SAMPLE x, where x is an integer and 0=x=100. Each row will be selected if rand()(x/100). Example: a = LOAD 'mybigdata' b = SAMPLE 5 ... will select 5% of the data. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh Setty updated PIG-794: - Attachment: AvroBinStorage.patch Patch file Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Attachments: AvroBinStorage.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.