Hi ,
Thanks so much It worked for me but can you please explain  ([^<]*)
and \\n\\s* part by symbols from below.

(RegexExtractAll(revision,'<id>([^<]*)</id>\\n\\s*<revision>\\n\\s*<id>([^>]*)</id>\\n\\s*<username>([^>]*)</username>\\n\\s*</revision>')
)

Thanks
Krishnan

On Thu, May 17, 2012 at 7:08 AM, Francisco Javier Gonzalez Garcia <
[email protected]> wrote:

> this is an example of one revison for page (in other case is more
> complex but it's possible):
>
> REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
> DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader();
> DEFINE RegexExtractAll
> org.apache.pig.piggybank.evaluation.string.RegexExtractAll();
>
> revisionXML = LOAD 'Revision.xml' USING XMLLoader('page') AS
> (revision:chararray);
>
> rev = FOREACH revisionXML GENERATE FLATTEN
>
> (RegexExtractAll(revision,'<id>([^<]*)</id>\\n\\s*<revision>\\n\\s*<id>([^>]*)</id>\\n\\s*<username>([^>]*)</username>\\n\\s*</revision>')
> )
> AS
> (
> page: chararray,
> id_revision: chararray,
> username: chararray,
> );
>
>
> dump rev;
>
>
>
> 2012/5/17, Herbert Mühlburger <[email protected]>:
> > Hi list,
> >
> > I would like to parse the following XML-File using Pig:
> >
> > <page>
> >    <id>1</id>
> > <revision>
> >      <id>1</id>
> >      <username>muehlburger</username>
> > </revision>
> > <revision>
> >      <id>2</id>
> >      <username>muehlburger</username>
> > </revision>
> > <revision>
> >      <id>3</id>
> >      <username>user1</username>
> > </revision>
> > ...
> > <revision>
> >      <id>34334398</id>
> >      <username>muehlburger</username>
> > </revision>
> > </page>
> > <page>
> >    <id>2</id>
> > <revision>
> >      <id>343434</id>
> >      <username>muehlburger</username>
> > </revision>
> > <revision>
> >      <id>25343232</id>
> >      <username>muehlburger</username>
> > </revision>
> > <revision>
> >      <id>43434333</id>
> >      <username>user2</username>
> > </revision>
> > ...
> > <revision>
> >      <id>5409589854</id>
> >      <username>user5</username>
> > </revision>
> > </page>
> > ...
> >
> > I would like to produce the following kind of csv output:
> >
> > page_id revision_id username
> > 1 1 muehlburger
> > 1 2 muehlburger
> > 1 3 user1
> > 1 34334398 muehlburger
> > 2 343434 muehlburger
> > 2 25343232 muehlburger
> > 2 43434333 user2
> > 2 5409589854 user5
> >
> > How can I acomplish this using PIG?
> >
> > Thank you very much for your help!
> >
> > Kind regards,
> > Herbert
> > --
> > =================================================================
> > Herbert Muehlburger  Software Development and Business Management
> >                                      Graz University of Technology
> > www.muehlburger.at                   www.twitter.com/hmuehlburger
> > =================================================================
> >
>

Reply via email to