[Pig Wiki] Update of ExampleGenerator by ShubhamChopra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by ShubhamChopra: http://wiki.apache.org/pig/ExampleGenerator New page: ILLUSTRATE Command : Illustrate is a new addition to pig that helps users debug their pig scripts. The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. The ExampleGenerator algorithm can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data items, giving you empty results which is of no help in debugging. This ILLUSTRATE functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources. The algorithm uses the Local execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. Usage : Illustrate command can be used in the following way: Say the input file is 'visits.txt' containing the following data : {{{ Amy cnn.com 20070218 Fredharvard.edu 20071204 Amy bbc.com 20071205 Fredstanford.edu20071206 }}} A grunt session might look something like this : {{{ grunt visits = load 'visits.txt' as (user, url, timestamp); grunt recent_visits = filter visits by timestamp = '20071201'; grunt user_visits = group recent_visits by user; grunt num_user_visits = foreach user_visits generate group, COUNT(recent_visits); grunt illustrate num_user_visits }}} This would trigger the ExampleGenerator which will display examples something like this: {{{ - | visits | user | url | timestamp | - || Fred | harvard.edu | 20071204 | || Fred | stanford.edu | 20071206 | || Amy | cnn.com | 20070218 | - | recent_visits | user | url | timestamp | | | Fred | harvard.edu | 20071204 | | | Fred | stanford.edu | 20071206 | - | user_visits | group | recent_visits: (user, url, timestamp ) | - | | Fred | {(Fred, harvard.edu, 20071204), (Fred, stanford.edu, 20071206)} | - | num_user_visits | group | count1 | | | Fred | 2 | }}}
[Pig Wiki] Update of ExampleGenerator by ShubhamChopra
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by ShubhamChopra: http://wiki.apache.org/pig/ExampleGenerator -- + == Illustrate == - ILLUSTRATE Command : - Illustrate is a new addition to pig that helps users debug their pig scripts. The idea is to select a few example data items, and illustrate how they are transformed by the sequence of Pig commands in the user's program. The ExampleGenerator algorithm can select an appropriate and concise set of example data items automatically. It does a better job than random sampling would do; for example, random sampling suffers from the drawback that selective operations such as filters or joins can eliminate all the sampled data items, giving you empty results which is of no help in debugging. This ILLUSTRATE functionality will avoid people having to test their Pig programs on large data sets, which has a long turnaround time and wastes system resources. The algorithm uses the Local execution operators (it does not run on hadoop), so as to generate illustrative example data in near-real-time for the user. - Usage : + === Usage === Illustrate command can be used in the following way: Say the input file is 'visits.txt' containing the following data :
svn commit: r647997 - in /incubator/pig/trunk: ./ src/org/apache/pig/ src/org/apache/pig/backend/executionengine/ src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/ src/org/apache/pig/ba
Author: gates Date: Mon Apr 14 14:04:05 2008 New Revision: 647997 URL: http://svn.apache.org/viewvc?rev=647997view=rev Log: PIG-188: Fix mismatches between pig slicer changes and new streaming feature. Modified: incubator/pig/trunk/CHANGES.txt incubator/pig/trunk/src/org/apache/pig/Slice.java incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java incubator/pig/trunk/test/org/apache/pig/test/RangeSlicer.java Modified: incubator/pig/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/incubator/pig/trunk/CHANGES.txt?rev=647997r1=647996r2=647997view=diff == --- incubator/pig/trunk/CHANGES.txt (original) +++ incubator/pig/trunk/CHANGES.txt Mon Apr 14 14:04:05 2008 @@ -228,3 +228,6 @@ 1k caused pig to freeze. (kali via gates) PIG-204: Repair broken input splits (acmurthy via gates). + + PIG-188: Fix mismatches between pig slicer changes and new streaming + feature (acmurthy via gates). Modified: incubator/pig/trunk/src/org/apache/pig/Slice.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/Slice.java?rev=647997r1=647996r2=647997view=diff == --- incubator/pig/trunk/src/org/apache/pig/Slice.java (original) +++ incubator/pig/trunk/src/org/apache/pig/Slice.java Mon Apr 14 14:04:05 2008 @@ -41,6 +41,11 @@ void init(DataStorage store) throws IOException; /** + * Returns the offset from which data in this Slice will be processed. + */ +long getStart(); + +/** * Returns the length in bytes of all of the data that will be processed by * this Slice. * p Modified: incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java?rev=647997r1=647996r2=647997view=diff == --- incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java (original) +++ incubator/pig/trunk/src/org/apache/pig/backend/executionengine/PigSlice.java Mon Apr 14 14:04:05 2008 @@ -48,6 +48,10 @@ return new String[] { file }; } +public long getStart() { +return start; +} + public long getLength() { return length; } Modified: incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java?rev=647997r1=647996r2=647997view=diff == --- incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java (original) +++ incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/SliceWrapper.java Mon Apr 14 14:04:05 2008 @@ -92,6 +92,15 @@ DataStorage store = new HDataStorage(ConfigurationUtil.toProperties(job)); store.setActiveContainer(store.asContainer(/user/ + job.getUser())); wrapped.init(store); + +// Mimic org.apache.hadoop.mapred.FileSplit if feasible... +String[] locations = wrapped.getLocations(); +if (locations.length 0) { +job.set(map.input.file, locations[0]); +job.setLong(map.input.start, wrapped.getStart()); +job.setLong(map.input.length, wrapped.getLength()); +} + return new RecordReaderText, Tuple() { public void close() throws IOException { Modified: incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java?rev=647997r1=647996r2=647997view=diff == --- incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java (original) +++ incubator/pig/trunk/src/org/apache/pig/backend/hadoop/streaming/HadoopExecutableManager.java Mon Apr 14 14:04:05 2008 @@ -187,12 +187,13 @@ } processError(\nCommand: + sb.toString()); processError(\nStart time: + new Date(System.currentTimeMillis())); -processError(\nInput-split file: + job.get(map.input.file)); -processError(\nInput-split start-offset: + -job.getLong(map.input.start, -1)); -processError(\nInput-split length: + -