[jira] Updated: (PIG-732) Utility UDFs

2009-04-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-732:
--

Attachment: udf.v5.patch

Minor issue in test case, causing test failure. Fixed in latest upload - 
udf.v5.patch. Also changed TopN to Top. Should be good to go now.

 Utility UDFs 
 -

 Key: PIG-732
 URL: https://issues.apache.org/jira/browse/PIG-732
 Project: Pig
  Issue Type: New Feature
Reporter: Ankur
Priority: Minor
 Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, 
 udf.v5.patch


 Two utility UDFs and their respective test cases.
 1. TopN - Accepts number of tuples (N) to retain in output, field number 
 (type long) to use for comparison, and an sorted/unsorted bag of tuples. It 
 outputs a bag containing top N tuples.
 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines 
 (Yahoo, Google, AOL, Live) and extracts and normalizes the search query 
 present in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-789) coupling load and store in script no longer works

2009-04-30 Thread Gunther Hagleitner (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gunther Hagleitner updated PIG-789:
---

Attachment: dump_bug.patch

Both dump (openIterator) and illustrate (getExamples) show this problem. 
dump_bug.patch contains a fix; The patch is for the trunk.

 coupling load and store in script no longer works
 -

 Key: PIG-789
 URL: https://issues.apache.org/jira/browse/PIG-789
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.3.0
Reporter: Alan Gates
Assignee: Gunther Hagleitner
 Attachments: dump_bug.patch


 Many user's pig script do something like this:
 a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 c = filter a by age  500;
 e = group c by (name, age);
 f = foreach e generate group, COUNT($1);
 store f into 'bla';
 f1 = load 'bla';
 g = order f1 by $1;
 dump g;
 With the inclusion of the multi-query phase2 patch this appears to no longer 
 work.  You get an error:
 2009-04-28 18:24:50,776 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 2100: hdfs://wilbur11.labs.corp.sp1.yahoo.com/user/gates/bla does not exist.
 We shouldn't be checking for bla's existence here because it will be created 
 eventually by the script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-04-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704672#action_12704672
 ] 

Alan Gates commented on PIG-741:


Since limit distributes rather nicely, I'd very much like it to use the 
combiner.  But after looking at the code for a bit I realized I could wait for 
the work Santosh is doing on the optimizer and use that (see PIG-697) or 
rewrite a bunch of that code myself.  I decided it was better to check in a 
limited version of limit (hah) now and get the combiner functionality in a 
month or two.  Glad to hear it will work for you now.

 Add LIMIT as a statement that works in nested FOREACH
 -

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-741.patch


 I'd like to compute the top 10 results in each group.
 The natural way to express this in Pig would be:
 {code}
 A = load '...' using PigStorage() as (
 date: int,
 count: int,
 url: chararray
 );
 B = group A by ( date );
 C = foreach B {
 D = order A by count desc;
 E = limit D 10;
 generate
 FLATTEN(E);
 };
 dump C;
 {code}
 Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
 since LIMIT already exists as a statement, it seems like it should also work 
 in the nested foreach context.
 Example workaround code.
 {code}
 C = foreach B {
 D = order A by count desc;
 E = util.TOP(D, 10);
 generate
 FLATTEN(E);
 };
 dump C;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-04-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704685#action_12704685
 ] 

Olga Natkovich commented on PIG-741:


+1 on the patch with one question: is there a reason why tests were only added 
for local mode?

 Add LIMIT as a statement that works in nested FOREACH
 -

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-741.patch


 I'd like to compute the top 10 results in each group.
 The natural way to express this in Pig would be:
 {code}
 A = load '...' using PigStorage() as (
 date: int,
 count: int,
 url: chararray
 );
 B = group A by ( date );
 C = foreach B {
 D = order A by count desc;
 E = limit D 10;
 generate
 FLATTEN(E);
 };
 dump C;
 {code}
 Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
 since LIMIT already exists as a statement, it seems like it should also work 
 in the nested foreach context.
 Example workaround code.
 {code}
 C = foreach B {
 D = order A by count desc;
 E = util.TOP(D, 10);
 generate
 FLATTEN(E);
 };
 dump C;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-04-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704775#action_12704775
 ] 

Alan Gates commented on PIG-741:


I only added tests for local mode because inner operators are executed in local 
mode one way or another, so I didn't think there was a need to test it in the 
map reduce case as well.

 Add LIMIT as a statement that works in nested FOREACH
 -

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz
Assignee: Alan Gates
 Fix For: 0.3.0

 Attachments: PIG-741.patch


 I'd like to compute the top 10 results in each group.
 The natural way to express this in Pig would be:
 {code}
 A = load '...' using PigStorage() as (
 date: int,
 count: int,
 url: chararray
 );
 B = group A by ( date );
 C = foreach B {
 D = order A by count desc;
 E = limit D 10;
 generate
 FLATTEN(E);
 };
 dump C;
 {code}
 Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
 since LIMIT already exists as a statement, it seems like it should also work 
 in the nested foreach context.
 Example workaround code.
 {code}
 C = foreach B {
 D = order A by count desc;
 E = util.TOP(D, 10);
 generate
 FLATTEN(E);
 };
 dump C;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-793) Improving memory efficiency of Tuple implementation

2009-04-30 Thread Olga Natkovich (JIRA)
Improving memory efficiency of Tuple implementation
---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich


Currently, our tuple is a real pig and uses a lot of extra memory. 

There are several places where we can improve memory efficiency:

(1) Laying out memory for the fields rather than using java objects since since 
each object for a numeric field takes 16 bytes
(2) For the cases where we know the schema using Java arrays rather than 
ArrayList.

There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-732) Utility UDFs

2009-04-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-732:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Patch committed. Thanks, Ankur for contributing.

 Utility UDFs 
 -

 Key: PIG-732
 URL: https://issues.apache.org/jira/browse/PIG-732
 Project: Pig
  Issue Type: New Feature
Reporter: Ankur
Priority: Minor
 Attachments: udf.v1.patch, udf.v2.patch, udf.v3.patch, udf.v4.patch, 
 udf.v5.patch


 Two utility UDFs and their respective test cases.
 1. TopN - Accepts number of tuples (N) to retain in output, field number 
 (type long) to use for comparison, and an sorted/unsorted bag of tuples. It 
 outputs a bag containing top N tuples.
 2. SearchQuery - Accepts an encoded URL from any of the 4 search engines 
 (Yahoo, Google, AOL, Live) and extracts and normalizes the search query 
 present in it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-794) Use Avro serialization in Pig

2009-04-30 Thread Rakesh Setty (JIRA)
Use Avro serialization in Pig
-

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty


We would like to use Avro serialization in Pig to pass data between MR jobs 
instead of the current BinStorage. Attached is an implementation of 
AvroBinStorage which performs significantly better compared to BinStorage on 
our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-04-30 Thread Eric Gaudet (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Gaudet updated PIG-795:


Attachment: sample2.diff

This patch implements the SAMPLE command. It basically add a random sample mode 
to the LIMIT class. 

The syntax is like LIMIT: a = SAMPLE x, where x is an integer and 0=x=100. 
Each row will be selected if rand()(x/100).

Example:

a = LOAD 'mybigdata'
b = SAMPLE 5
...

will select 5% of the data.



 Command that selects a random sample of the rows, similar to LIMIT
 --

 Key: PIG-795
 URL: https://issues.apache.org/jira/browse/PIG-795
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Eric Gaudet
Priority: Trivial
 Attachments: sample2.diff


 When working with very large data sets (imagine that!), running a pig script 
 can take time. It may be useful to run on a small subset of the data in some 
 situations (eg: debugging / testing, or to get fast results even if less 
 accurate.) 
 The command LIMIT N selects the first N rows of the data, but these are not 
 necessarily randomzed. A command SAMPLE X would retain the row only with 
 the probability x%.
 Note: it is possible to implement this feature with FILTER BY and an UDF, but 
 so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-794) Use Avro serialization in Pig

2009-04-30 Thread Rakesh Setty (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Setty updated PIG-794:
-

Attachment: AvroBinStorage.patch

Patch file

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
 Attachments: AvroBinStorage.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.