Hi, I want to add a row/line number to the data I read from multiple CSVs. 
However I want the running number reflect the line number *per input file*, not 
overall.

I am happy to write a Python UDF for this. So far I have in the UDF:

    --- Python file udf.py --- 
    lineNum = 0 

    @outputSchema("lnum:int, f1:chararray")
    def makeData(line):
        global lineNum
        lineNum += 1
        return lineNum, line.tostring()

which is called from Pig:

    --- Pig file use-udf.pig ---
    register 'udf.py' using jython as udfs;

    data = load 'datadir' using TextLoader() as line;
    udfified = foreach data generate udfs.makeData(line);

    dump udfified;

This approach works, *but* the running number increases over multiple files in 
the directory "datadir". That is *not* what I want! I need the row number 
starting with 1 for each file in datadir. Maybe I can reset the lineNum 
variable per input file?

Any idea how to achieve this? Either with plain Pig or with Python UDFs?

Many thanks, Leo

Reply via email to