Hi! Probably these can help: http://pig.apache.org/docs/r0.11.1/basic.html#rank http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for -tagsource)
I've never tried this, but probably you could group by tagsource and then apply RANK Ruslan On Fri, Aug 16, 2013 at 6:17 AM, Leo <[email protected]> wrote: > Hi, I want to add a row/line number to the data I read from multiple CSVs. > However I want the running number reflect the line number *per input file*, > not overall. > > I am happy to write a Python UDF for this. So far I have in the UDF: > > --- Python file udf.py --- > lineNum = 0 > > @outputSchema("lnum:int, f1:chararray") > def makeData(line): > global lineNum > lineNum += 1 > return lineNum, line.tostring() > > which is called from Pig: > > --- Pig file use-udf.pig --- > register 'udf.py' using jython as udfs; > > data = load 'datadir' using TextLoader() as line; > udfified = foreach data generate udfs.makeData(line); > > dump udfified; > > This approach works, *but* the running number increases over multiple > files in the directory "datadir". That is *not* what I want! I need the row > number starting with 1 for each file in datadir. Maybe I can reset the > lineNum variable per input file? > > Any idea how to achieve this? Either with plain Pig or with Python UDFs? > > Many thanks, Leo >
