Hi, I want to add a row/line number to the data I read from multiple CSVs.
However I want the running number reflect the line number *per input file*, not
overall.
I am happy to write a Python UDF for this. So far I have in the UDF:
--- Python file udf.py ---
lineNum = 0
@outputSchema("lnum:int, f1:chararray")
def makeData(line):
global lineNum
lineNum += 1
return lineNum, line.tostring()
which is called from Pig:
--- Pig file use-udf.pig ---
register 'udf.py' using jython as udfs;
data = load 'datadir' using TextLoader() as line;
udfified = foreach data generate udfs.makeData(line);
dump udfified;
This approach works, *but* the running number increases over multiple files in
the directory "datadir". That is *not* what I want! I need the row number
starting with 1 for each file in datadir. Maybe I can reset the lineNum
variable per input file?
Any idea how to achieve this? Either with plain Pig or with Python UDFs?
Many thanks, Leo