Combine input splits should be able to handle compressed files. It will create seperate RecordReader for each file within one input split. So gzip concatenation should not be the case. I am not sure what happen to your script. If possible, give us more information (script, UDF, data, version).

Daniel


On 02/28/2011 05:40 PM, Charles Gonçalves wrote:
Guys,

The amount of data in the source dir:
hdfs://hydra1:57810/user/cdh-hadoop/mscdata/201010_raw  22567369111

What I did was:
I run with all logs, 43458 and the counters are:

FILE_BYTES_READ         253,905,706     372,708,857     626,614,563
HDFS_BYTES_READ         2,553,123,734   0       2,553,123,734
FILE_BYTES_WRITTEN      619,877,917     372,708,857     992,586,774
HDFS_BYTES_WRITTEN      0       535     535


I did a manual join of the files and run again for the 336 files (the merge of all those files).
The job didn't finished yet and the counters are:

FILE_BYTES_READ         21,054,970,818  0       21,054,970,818
HDFS_BYTES_READ         16,772,063,486  0       16,772,063,486
FILE_BYTES_WRITTEN      39,797,038,008  10,404,287,551  50,201,325,55



I think that the problem could be in the combination of the input files.
Is the combination class aware of compression.
Because *all my files are compressed*.
Maybe the class perform a concatenation and we fall in the hdfs limitation of gzip concatenated files.

On Mon, Feb 28, 2011 at 8:47 PM, Charles Gonçalves <[email protected] <mailto:[email protected]>> wrote:



    On Mon, Feb 28, 2011 at 7:39 PM, Thejas M Nair
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Charles,
        Which load function are you using ?

    I'm using a UD load function ..

        Is the default (PigStorage?).

    Nops ...

        In the hadoop counters for the job in the jobtracker ui, do
        you see the expected number of input records being read?

    Is possible to see the counter in the history interface on
    JobTracker?
    I will run the jobs again to compare the counter, but my guess is
    probably not!

        -Thejas




        On 2/28/11 10:57 AM, "Charles Gonçalves" <[email protected]
        <mailto:[email protected]>> wrote:

            I'm not using any filtering in the script.
            I'm just want to see the total traffic per day in all logs.

            If I combine 1000 log files into  one and run the script
            on this log files I
            got the correct answer for those logs.
            But when I'm run with   all the *43458* log files I got a
            incorrect output.
            The correct would be an histogram for each day from
            2010-10 but the result
            contain only data from 2010-10-21.
            And if I process all the logs with an awk script I got the
            correct answer.


            On Mon, Feb 28, 2011 at 3:29 PM, Daniel Dai
            <[email protected] <mailto:[email protected]>>
            wrote:

            > Not sure if I get your question. In 0.8, Pig combine
            small files into one
            > map, so it is possible you get less output files.

            This is not the problem.
            But thanks anyway!

            If that is your concern, you can try to disable split
            combine using
            > "-Dpig.splitCombination=false"
            >
            > Daniel
            >
            >
            > Charles Gonçalves wrote:
            >
            >> I tried to process a big number of small files on pig
            and I got a strange
            >> problem.
            >>
            >> 2011-02-27 00:00:58,746 [Thread-15] INFO
            >>  org.apache.hadoop.mapreduce.lib.input.FileInputFormat -
            Total input paths
            >> to process : *43458*
            >> 2011-02-27 00:00:58,755 [Thread-15] INFO
            >>
             org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
            - Total
            >> input
            >> paths to process : *43458*
            >> 2011-02-27 00:01:14,173 [Thread-15] INFO
            >>
             org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil
            - Total
            >> input
            >> paths (combined) to process : *329*
            >>
            >> When the script finish to process, the result is just
            about a subgroup of
            >> the input files.
            >> These are logs from a whole month,  but the results are
            just from the day
            >> 21.
            >>
            >>
            >> Maybe I'm missing something.
            >> Any Ideas?
            >>
            >>
            >>
            >
            >


            --
            *Charles Ferreira Gonçalves *
            http://homepages.dcc.ufmg.br/~charles/
            <http://homepages.dcc.ufmg.br/%7Echarles/>
            UFMG - ICEx - Dcc
            Cel.: 55 31 87741485
            Tel.:  55 31 34741485
            Lab.: 55 31 34095840





-- *Charles Ferreira Gonçalves *
    http://homepages.dcc.ufmg.br/~charles/
    <http://homepages.dcc.ufmg.br/%7Echarles/>
    UFMG - ICEx - Dcc
    Cel.: 55 31 87741485
    Tel.:  55 31 34741485
    Lab.: 55 31 34095840




--
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/ <http://homepages.dcc.ufmg.br/%7Echarles/>
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840

Reply via email to