date:20090331

[jira] Created: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

2009-03-31 Thread David Ciemiewicz (JIRA)

Add LIMIT as a statement that works in nested FOREACH
-

 Key: PIG-741
 URL: https://issues.apache.org/jira/browse/PIG-741
 Project: Pig
  Issue Type: New Feature
Reporter: David Ciemiewicz


I'd like to compute the top 10 results in each group.

The natural way to express this in Pig would be:

{code}
A = load '...' using PigStorage() as (
date: int,
count: int,
url: chararray
);

B = group A by ( date );

C = foreach B {
D = order A by count desc;
E = limit D 10;
generate
FLATTEN(E);
};

dump C;
{code}

Yeah, I could write a UDF / PiggyBank function to take the top n results. But 
since LIMIT already exists as a statement, it seems like it should also work in 
the nested foreach context.

Example workaround code.

{code}
C = foreach B {
D = order A by count desc;
E = util.TOP(D, 10);
generate
FLATTEN(E);
};

dump C;
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-729) Use of default parallelism

2009-03-31 Thread Milind Bhandarkar (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694144#action_12694144
 ] 

Milind Bhandarkar commented on PIG-729:
---

+1 for option 3. Make parallel keyword mandatory on all statements that require 
it.

To elaborate:

Option 1. There can be no default that satisfies the majority.
Option 2. Unless it is an error that terminates execution, messages are usually 
ignored.
Option 3. Making parallel keyword mandatory increases awareness of its relation 
with number of reducers and number of part files.

 Use of default parallelism
 --

 Key: PIG-729
 URL: https://issues.apache.org/jira/browse/PIG-729
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.1
 Environment: Hadoop 0.20
Reporter: Santhosh Srinivasan
 Fix For: 0.2.1


 Currently, if the user does not specify the number of reduce slots using the 
 parallel keyword, Pig lets Hadoop decide on the default number of reducers. 
 This model worked well with dynamically allocated clusters using HOD and for 
 static clusters where the default number of reduce slots was explicitly set. 
 With Hadoop 0.20, a single static cluster will be shared amongst a number of 
 queues. As a result, a common scenario is to end up with default number of 
 reducers set to one (1).
 When users migrate to Hadoop 0.20, they might see a dramatic change in the 
 performance of their queries if they had not used the parallel keyword to 
 specify the number of reducers. In order to mitigate such circumstances, Pig 
 can support one of the following:
 1. Specify a default parallelism for the entire script.
 This option will allow users to use the same parallelism for all operators 
 that do not have the explicit parallel keyword. This will ensure that the 
 scripts utilize more reducers than the default of one reducer. On the down 
 side, due to data transformations, usually operations that are performed 
 towards the end of the script will need smaller number of reducers compared 
 to the operators that appear at the beginning of the script.
 2. Display a warning message for each reduce side operator that does have the 
 use of the explicit parallel keyword. Proceed with the execution.
 3. Display an error message indicating the operator that does not have the 
 explicit use of the parallel keyword. Stop the execution.
 Other suggestions/thoughts/solutions are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-741) Add LIMIT as a statement that works in nested FOREACH

[jira] Commented: (PIG-729) Use of default parallelism

2 matches

Site Navigation

Mail list logo

Footer information