Hi, these symbols belong to regex java class: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
2012/5/18 krishnan N <[email protected]> > Hi , > Thanks so much It worked for me but can you please explain ([^<]*) > and \\n\\s* part by symbols from below. > > > (RegexExtractAll(revision,'<id>([^<]*)</id>\\n\\s*<revision>\\n\\s*<id>([^>]*)</id>\\n\\s*<username>([^>]*)</username>\\n\\s*</revision>') > ) > > Thanks > Krishnan > > On Thu, May 17, 2012 at 7:08 AM, Francisco Javier Gonzalez Garcia < > [email protected]> wrote: > > > this is an example of one revison for page (in other case is more > > complex but it's possible): > > > > REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar; > > DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader(); > > DEFINE RegexExtractAll > > org.apache.pig.piggybank.evaluation.string.RegexExtractAll(); > > > > revisionXML = LOAD 'Revision.xml' USING XMLLoader('page') AS > > (revision:chararray); > > > > rev = FOREACH revisionXML GENERATE FLATTEN > > > > > (RegexExtractAll(revision,'<id>([^<]*)</id>\\n\\s*<revision>\\n\\s*<id>([^>]*)</id>\\n\\s*<username>([^>]*)</username>\\n\\s*</revision>') > > ) > > AS > > ( > > page: chararray, > > id_revision: chararray, > > username: chararray, > > ); > > > > > > dump rev; > > > > > > > > 2012/5/17, Herbert Mühlburger <[email protected]>: > > > Hi list, > > > > > > I would like to parse the following XML-File using Pig: > > > > > > <page> > > > <id>1</id> > > > <revision> > > > <id>1</id> > > > <username>muehlburger</username> > > > </revision> > > > <revision> > > > <id>2</id> > > > <username>muehlburger</username> > > > </revision> > > > <revision> > > > <id>3</id> > > > <username>user1</username> > > > </revision> > > > ... > > > <revision> > > > <id>34334398</id> > > > <username>muehlburger</username> > > > </revision> > > > </page> > > > <page> > > > <id>2</id> > > > <revision> > > > <id>343434</id> > > > <username>muehlburger</username> > > > </revision> > > > <revision> > > > <id>25343232</id> > > > <username>muehlburger</username> > > > </revision> > > > <revision> > > > <id>43434333</id> > > > <username>user2</username> > > > </revision> > > > ... > > > <revision> > > > <id>5409589854</id> > > > <username>user5</username> > > > </revision> > > > </page> > > > ... > > > > > > I would like to produce the following kind of csv output: > > > > > > page_id revision_id username > > > 1 1 muehlburger > > > 1 2 muehlburger > > > 1 3 user1 > > > 1 34334398 muehlburger > > > 2 343434 muehlburger > > > 2 25343232 muehlburger > > > 2 43434333 user2 > > > 2 5409589854 user5 > > > > > > How can I acomplish this using PIG? > > > > > > Thank you very much for your help! > > > > > > Kind regards, > > > Herbert > > > -- > > > ================================================================= > > > Herbert Muehlburger Software Development and Business Management > > > Graz University of Technology > > > www.muehlburger.at www.twitter.com/hmuehlburger > > > ================================================================= > > > > > >
