That's an interesting approach! Although, I'm not sure if RANK is supported as a nested foreach operator. If it is supported, then this approach would work. The documentation doesn't show that RANK is a supported nested foreach operator.
http://pig.apache.org/docs/r0.11.1/basic.html#foreach On Wed, Aug 21, 2013 at 11:03 AM, Ruslan Al-Fakikh <[email protected]>wrote: > Hi! > > Probably these can help: > http://pig.apache.org/docs/r0.11.1/basic.html#rank > http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for > -tagsource) > > I've never tried this, but probably you could group by tagsource and then > apply RANK > > Ruslan > > > On Fri, Aug 16, 2013 at 6:17 AM, Leo <[email protected]> wrote: > > > Hi, I want to add a row/line number to the data I read from multiple > CSVs. > > However I want the running number reflect the line number *per input > file*, > > not overall. > > > > I am happy to write a Python UDF for this. So far I have in the UDF: > > > > --- Python file udf.py --- > > lineNum = 0 > > > > @outputSchema("lnum:int, f1:chararray") > > def makeData(line): > > global lineNum > > lineNum += 1 > > return lineNum, line.tostring() > > > > which is called from Pig: > > > > --- Pig file use-udf.pig --- > > register 'udf.py' using jython as udfs; > > > > data = load 'datadir' using TextLoader() as line; > > udfified = foreach data generate udfs.makeData(line); > > > > dump udfified; > > > > This approach works, *but* the running number increases over multiple > > files in the directory "datadir". That is *not* what I want! I need the > row > > number starting with 1 for each file in datadir. Maybe I can reset the > > lineNum variable per input file? > > > > Any idea how to achieve this? Either with plain Pig or with Python UDFs? > > > > Many thanks, Leo > > >
