Re: dev How can I add a row number per input file to the data

Pradeep Gollakota Wed, 21 Aug 2013 08:34:39 -0700

That's an interesting approach! Although, I'm not sure if RANK is supported
as a nested foreach operator. If it is supported, then this approach would
work. The documentation doesn't show that RANK is a supported nested
foreach operator.


http://pig.apache.org/docs/r0.11.1/basic.html#foreach


On Wed, Aug 21, 2013 at 11:03 AM, Ruslan Al-Fakikh <[email protected]>wrote:

> Hi!
>
> Probably these can help:
> http://pig.apache.org/docs/r0.11.1/basic.html#rank
> http://pig.apache.org/docs/r0.11.1/func.html#pigstorage (look for
> -tagsource)
>
> I've never tried this, but probably you could group by tagsource and then
> apply RANK
>
> Ruslan
>
>
> On Fri, Aug 16, 2013 at 6:17 AM, Leo <[email protected]> wrote:
>
> > Hi, I want to add a row/line number to the data I read from multiple
> CSVs.
> > However I want the running number reflect the line number *per input
> file*,
> > not overall.
> >
> > I am happy to write a Python UDF for this. So far I have in the UDF:
> >
> >     --- Python file udf.py ---
> >     lineNum = 0
> >
> >     @outputSchema("lnum:int, f1:chararray")
> >     def makeData(line):
> >         global lineNum
> >         lineNum += 1
> >         return lineNum, line.tostring()
> >
> > which is called from Pig:
> >
> >     --- Pig file use-udf.pig ---
> >     register 'udf.py' using jython as udfs;
> >
> >     data = load 'datadir' using TextLoader() as line;
> >     udfified = foreach data generate udfs.makeData(line);
> >
> >     dump udfified;
> >
> > This approach works, *but* the running number increases over multiple
> > files in the directory "datadir". That is *not* what I want! I need the
> row
> > number starting with 1 for each file in datadir. Maybe I can reset the
> > lineNum variable per input file?
> >
> > Any idea how to achieve this? Either with plain Pig or with Python UDFs?
> >
> > Many thanks, Leo
> >
>

Re: dev How can I add a row number per input file to the data

Reply via email to