I have a large data set (> 2 TB) and I tried scanning 100 records from it.

a = load '/usr/largedata/' using PigStorage(',');
b = limit a 100;
dump b;

>>>>
2011-09-11 21:56:34,262 [main] INFO
 org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: LIMIT
2011-09-11 21:56:34,414 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2011-09-11 21:56:34,483 [main] INFO
 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2011-09-11 21:56:34,484 [main] INFO
 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
>>>>

This ends up launching a MR job with 20,000+ Maps and a single reducer.

Is it possible for PIG to analyze such cases and realistically scan only 100
rows (rather than scanning the entire data and emitting 100 rows?).

This is on PIG 0.9.

-- 
~Rajesh.B

Reply via email to