[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-10-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

--
  
  }}}
  
+ 
+ === Illustrate for Pig 2.0 ===
+ Illustrate is now also a part of Pig 2.0. The following are not currently 
supported and are on the road-map:
+  * LIMIT
+  * SPLIT (both implicit and explicit)
+  * Nested FOREACH
+  * MAPS data-type
+ 


[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-04-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

--
  Amy bbc.com 20071205
  Fredstanford.edu20071206
  }}}
- A grunt session might look something like this (Note the use of alias while 
loading data. ExampleGenerator needs you to provide aliases) :
+ A grunt session might look something like this (Note the use of schemas while 
loading data. ExampleGenerator needs you to provide aliases) :
  {{{
  grunt visits = load 'visits.txt' as (user, url, timestamp);
  grunt recent_visits = filter visits by timestamp = '20071201';


[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

New page:
ILLUSTRATE Command : 

Illustrate is a new addition to pig that helps users debug their pig scripts. 

The idea is to select a few example data items, and illustrate how they are 
transformed by the sequence of Pig commands in the user's program. The 
ExampleGenerator algorithm can select an appropriate and concise set of example 
data items automatically. It does a better job than random sampling would do; 
for example, random sampling suffers from the drawback that selective 
operations such as filters or joins can eliminate all the sampled data items, 
giving you empty results which is of no help in debugging.

This ILLUSTRATE functionality will avoid people having to test their Pig 
programs on large data sets, which has a long turnaround time and wastes system 
resources. The algorithm uses the Local execution operators (it does not run 
on hadoop), so as to generate illustrative example data in near-real-time for 
the user.

Usage :
Illustrate command can be used in the following way:

Say the input file is 'visits.txt' containing the following data :
{{{
Amy cnn.com 20070218
Fredharvard.edu 20071204
Amy bbc.com 20071205
Fredstanford.edu20071206
}}}
A grunt session might look something like this :
{{{
grunt visits = load 'visits.txt' as (user, url, timestamp);
grunt recent_visits = filter visits by timestamp = '20071201';
grunt user_visits = group recent_visits by user;
grunt num_user_visits = foreach user_visits generate group, 
COUNT(recent_visits);
grunt illustrate num_user_visits
}}}
This would trigger the ExampleGenerator which will display examples something 
like this:
{{{
-
| visits | user  | url  | timestamp | 
-
|| Fred  | harvard.edu  | 20071204  | 
|| Fred  | stanford.edu | 20071206  | 
|| Amy   | cnn.com  | 20070218  | 
-

| recent_visits | user  | url  | timestamp | 

|   | Fred  | harvard.edu  | 20071204  | 
|   | Fred  | stanford.edu | 20071206  | 

-
| user_visits | group | recent_visits: (user, url, timestamp )  
| 
-
| | Fred  | {(Fred, harvard.edu, 20071204), (Fred, 
stanford.edu, 20071206)} | 
-

| num_user_visits | group | count1 | 

| | Fred  | 2  | 

}}}


[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

--
+ == Illustrate ==
- ILLUSTRATE Command : 
- 
  Illustrate is a new addition to pig that helps users debug their pig scripts. 
  
  The idea is to select a few example data items, and illustrate how they are 
transformed by the sequence of Pig commands in the user's program. The 
ExampleGenerator algorithm can select an appropriate and concise set of example 
data items automatically. It does a better job than random sampling would do; 
for example, random sampling suffers from the drawback that selective 
operations such as filters or joins can eliminate all the sampled data items, 
giving you empty results which is of no help in debugging.
  
  This ILLUSTRATE functionality will avoid people having to test their Pig 
programs on large data sets, which has a long turnaround time and wastes system 
resources. The algorithm uses the Local execution operators (it does not run 
on hadoop), so as to generate illustrative example data in near-real-time for 
the user.
  
- Usage :
+ === Usage ===
  Illustrate command can be used in the following way:
  
  Say the input file is 'visits.txt' containing the following data :