Normally Pig 0.8 is just combining the small
files<http://pig.apache.org/docs/r0.8.0/cookbook.html#Combine+Small+Input+Files>into
bigger ones, you should not lose any records.

You might be filtering out/limiting some records in your script. You can try
just a LOAD and STORE and see that the output is the same as the input data.

Romain

On Sat, Feb 26, 2011 at 7:25 PM, Charles Gonçalves <[email protected]>wrote:

> I tried to process a big number of small files on pig and I got a strange
> problem.
>
> 2011-02-27 00:00:58,746 [Thread-15] INFO
>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths
> to process : *43458*
> 2011-02-27 00:00:58,755 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> paths to process : *43458*
> 2011-02-27 00:01:14,173 [Thread-15] INFO
>  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total
> input
> paths (combined) to process : *329*
>
> When the script finish to process, the result is just about a subgroup of
> the input files.
> These are logs from a whole month,  but the results are just from the day
> 21.
>
>
> Maybe I'm missing something.
> Any Ideas?
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>

Reply via email to