[Pig Wiki] Update of "PigUserCookbook" by AdilAijaz

Apache Wiki Thu, 23 Oct 2008 10:42:57 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by AdilAijaz:
http://wiki.apache.org/pig/PigUserCookbook

------------------------------------------------------------------------------
  
  The following are a list of tips that people have discovered for making their 
pig queries run faster.  Please feel free to add any tips you have.
  
- '''Project Early and Often'''
+ ''' Project Early and Often '''
  
  Pig does not (yet) determine when a field is no longer needed and drop the 
field from the row.  For example, say you have a query like:
  
@@ -75, +75 @@

  significant.  In one test where the key was null 7% of the time and the data 
was spread across 200 reducers, we saw a about a 10x speed up in the query by 
adding the early
  filters.
  
+ 
+ '''Prefer DISTINCT over GROUP BY - GENERATE'''
+ 
+ When it comes to extracting the unique values from a column in a relation, 
one of two approaches can be used:
+ 
+ ''Using GROUP BY - GENERATE''
+ 
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = foreach A generate u;
+ C = group B by u;
+ D = foreach C generate group as uniquekey;
+ dump D; 
+ }}}
+ 
+ ''Using DISTINCT''
+ 
+ {{{
+ A = load 'myfile' as (t, u, v);
+ B = foreach A generate u;
+ C = distinct B;
+ dump C; 
+ }}}
+ 
+ In pig 1.x, DISTINCT is just GROUP BY/PROJECT under the hood. In pig 2.0 
(types branch) it is not, and it is much faster and more efficient (depending 
on your key cardinality, up to 20x faster in pig team's tests). Therefore, the 
use of DISTINCT is recommended over GROUP BY - GENERATE. 
+

[Pig Wiki] Update of "PigUserCookbook" by AdilAijaz

Reply via email to