I'm not aware of any native PIG commands that can do this. So you'll have
to implement a UDF to do this. My implementation would look as follows:

A = load 'data' as (id: int, seg_num: int, text: chararray);
B = group A by id;
C = foreach B {
    D = order A by seg_num; -- assuming that data is not sorted by seg_num
    generate id, CONCAT_UDF(D);
};
dump C;

Within the CONCAT_UDF implementation, you have a DataBag as input whose
tuples are sorted by seg_num, so you can use a StringBuilder to concat the
strings together and return the resulting string.

Hope this helps.


On Sun, Jul 14, 2013 at 10:39 AM, Shahab Yunus <[email protected]>wrote:

> At least I am not aware of a PIG command which can do this. You can start
> by grouping on 'id',  and then try flattening the 'text' field. But then
> you run into the issue that you have lost the sorting order ('seg_no')
> which is required to construct a meaningful sentence. Here I think you need
> UDF where you pass both 'seq_no' and 'text' and do the work.
>
> I can think of doing some convoluted processing like concatenating the
> 'seg_no' and 'text' fields as one and then grouping on 'id' and then
> sorting on the new concatenated field within the group. But then once,
> you've done that, you will have to split back the combined field again. And
> doing all this might not help either. The main thing here is that, as far
> as I know, you cannot impose sort order in a bag or while flattening a
> group in one row. I would be interested to know if this is possible through
> native Pig.
>
> Regards,
> Shahab
>
>
> On Sat, Jul 13, 2013 at 9:45 PM, Karthik Natarajan <
> [email protected]> wrote:
>
> > Hi,
> >
> > I'm new to Pig. I have a file that contains the contents of documents.
> The
> > problem is that the contents are not in one line of the file. The file is
> > actually an export of a database table. Below is an example of the table:
> >
> > id seg_no  text
> > -- -----  -----
> > 1  0      This is
> > 1  1      a
> > 1  2      test for
> > 1  3      Hello
> > 1  4      World!
> > 2  0      Test
> > 2  1      number
> > 2  2      two.
> >
> >
> > How do I get an output like this:
> >
> > id  text
> > --  ----
> > 1   This is a test for Hello World!
> > 2   Test number two.
> >
> >
> > I can do this in SQL, but I want to try it using Hadoop and Pig. I'm not
> > sure how to concatenate values of a column w/in a group. I wondering if
> > Pig's built-in functions can handle this or if I have to create a UDF.
> I'm
> > thinking I need to create a UDF, but am not sure how to go about this.
> Any
> > help/advice would be appreciated.
> >
> > Thanks.
> >
>

Reply via email to