[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

New page:
ILLUSTRATE Command : 

Illustrate is a new addition to pig that helps users debug their pig scripts. 

The idea is to select a few example data items, and illustrate how they are 
transformed by the sequence of Pig commands in the user's program. The 
ExampleGenerator algorithm can select an appropriate and concise set of example 
data items automatically. It does a better job than random sampling would do; 
for example, random sampling suffers from the drawback that selective 
operations such as filters or joins can eliminate all the sampled data items, 
giving you empty results which is of no help in debugging.

This ILLUSTRATE functionality will avoid people having to test their Pig 
programs on large data sets, which has a long turnaround time and wastes system 
resources. The algorithm uses the Local execution operators (it does not run 
on hadoop), so as to generate illustrative example data in near-real-time for 
the user.

Usage :
Illustrate command can be used in the following way:

Say the input file is 'visits.txt' containing the following data :
{{{
Amy cnn.com 20070218
Fredharvard.edu 20071204
Amy bbc.com 20071205
Fredstanford.edu20071206
}}}
A grunt session might look something like this :
{{{
grunt visits = load 'visits.txt' as (user, url, timestamp);
grunt recent_visits = filter visits by timestamp = '20071201';
grunt user_visits = group recent_visits by user;
grunt num_user_visits = foreach user_visits generate group, 
COUNT(recent_visits);
grunt illustrate num_user_visits
}}}
This would trigger the ExampleGenerator which will display examples something 
like this:
{{{
-
| visits | user  | url  | timestamp | 
-
|| Fred  | harvard.edu  | 20071204  | 
|| Fred  | stanford.edu | 20071206  | 
|| Amy   | cnn.com  | 20070218  | 
-

| recent_visits | user  | url  | timestamp | 

|   | Fred  | harvard.edu  | 20071204  | 
|   | Fred  | stanford.edu | 20071206  | 

-
| user_visits | group | recent_visits: (user, url, timestamp )  
| 
-
| | Fred  | {(Fred, harvard.edu, 20071204), (Fred, 
stanford.edu, 20071206)} | 
-

| num_user_visits | group | count1 | 

| | Fred  | 2  | 

}}}


[Pig Wiki] Update of ExampleGenerator by ShubhamChopra

2008-04-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for change 
notification.

The following page has been changed by ShubhamChopra:
http://wiki.apache.org/pig/ExampleGenerator

--
+ == Illustrate ==
- ILLUSTRATE Command : 
- 
  Illustrate is a new addition to pig that helps users debug their pig scripts. 
  
  The idea is to select a few example data items, and illustrate how they are 
transformed by the sequence of Pig commands in the user's program. The 
ExampleGenerator algorithm can select an appropriate and concise set of example 
data items automatically. It does a better job than random sampling would do; 
for example, random sampling suffers from the drawback that selective 
operations such as filters or joins can eliminate all the sampled data items, 
giving you empty results which is of no help in debugging.
  
  This ILLUSTRATE functionality will avoid people having to test their Pig 
programs on large data sets, which has a long turnaround time and wastes system 
resources. The algorithm uses the Local execution operators (it does not run 
on hadoop), so as to generate illustrative example data in near-real-time for 
the user.
  
- Usage :
+ === Usage ===
  Illustrate command can be used in the following way:
  
  Say the input file is 'visits.txt' containing the following data :


svn commit: r647997 - in /incubator/pig/trunk: ./ src/org/apache/pig/ src/org/apache/pig/backend/executionengine/ src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/ src/org/apache/pig/ba

2008-04-14 Thread gates
Author: gates
Date: Mon Apr 14 14:04:05 2008
New Revision: 647997

URL: http://svn.apache.org/viewvc?rev=647997view=rev
Log:
PIG-188: Fix mismatches between pig slicer changes and new streaming feature.

Modified:
incubator/pig/trunk/CHANGES.txt
incubator/pig/trunk/src/org/apache/pig/Slice.java
incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java

incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java

incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
incubator/pig/trunk/test/org/apache/pig/test/RangeSlicer.java

Modified: incubator/pig/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/incubator/pig/trunk/CHANGES.txt?rev=647997r1=647996r2=647997view=diff
==
--- incubator/pig/trunk/CHANGES.txt (original)
+++ incubator/pig/trunk/CHANGES.txt Mon Apr 14 14:04:05 2008
@@ -228,3 +228,6 @@
1k caused pig to freeze. (kali via gates)
 
PIG-204: Repair broken input splits (acmurthy via gates).
+
+   PIG-188: Fix mismatches between pig slicer changes and new streaming
+   feature (acmurthy via gates).

Modified: incubator/pig/trunk/src/org/apache/pig/Slice.java
URL: 
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/Slice.java?rev=647997r1=647996r2=647997view=diff
==
--- incubator/pig/trunk/src/org/apache/pig/Slice.java (original)
+++ incubator/pig/trunk/src/org/apache/pig/Slice.java Mon Apr 14 14:04:05 2008
@@ -41,6 +41,11 @@
 void init(DataStorage store) throws IOException;
 
 /**
+ * Returns the offset from which data in this Slice will be processed.
+ */
+long getStart();
+
+/**
  * Returns the length in bytes of all of the data that will be processed by
  * this Slice.
  * p

Modified: 
incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java
URL: 
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java?rev=647997r1=647996r2=647997view=diff
==
--- 
incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java 
(original)
+++ 
incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java 
Mon Apr 14 14:04:05 2008
@@ -48,6 +48,10 @@
 return new String[] { file };
 }
 
+public long getStart() {
+return start;
+}
+
 public long getLength() {
 return length;
 }

Modified: 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java
URL: 
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java?rev=647997r1=647996r2=647997view=diff
==
--- 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java
 (original)
+++ 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java
 Mon Apr 14 14:04:05 2008
@@ -92,6 +92,15 @@
 DataStorage store = new 
HDataStorage(ConfigurationUtil.toProperties(job));
 store.setActiveContainer(store.asContainer(/user/ + job.getUser()));
 wrapped.init(store);
+
+// Mimic org.apache.hadoop.mapred.FileSplit if feasible...
+String[] locations = wrapped.getLocations();
+if (locations.length  0) {
+job.set(map.input.file, locations[0]);
+job.setLong(map.input.start, wrapped.getStart());   
+job.setLong(map.input.length, wrapped.getLength());
+}
+
 return new RecordReaderText, Tuple() {
 
 public void close() throws IOException {

Modified: 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
URL: 
http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java?rev=647997r1=647996r2=647997view=diff
==
--- 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
 (original)
+++ 
incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java
 Mon Apr 14 14:04:05 2008
@@ -187,12 +187,13 @@
 }
 processError(\nCommand:  + sb.toString());
 processError(\nStart time:  + new Date(System.currentTimeMillis()));
-processError(\nInput-split file:  + job.get(map.input.file));
-processError(\nInput-split start-offset:  + 
-job.getLong(map.input.start, -1));
-processError(\nInput-split length:  + 
-