I'm using 0.8.0. You're right, now I see that I'm actually doing something weird:
I'm using TextInputFormat to read my XML file line by line and construct the object to hold the xml element. Shouldn't actually the XML element <file/>, that spans over several lines be my record and not one single line? Will -----Original Message----- From: Daniel Dai [mailto:[email protected]] Sent: Tuesday, March 22, 2011 6:24 PM To: [email protected] Subject: Re: Preserve newlines in field Which Pig version are you using? If you are using Pig 0.7/0.8, line parsing is handled by hadoop TextInputFormat. You need to override the behavior of TextInputFormat in order to do that. You need to derive a new TextInputFormat which reserve newline characters, feed it to your LoadFunc(getInputFormat()). It is not trivial but doable. Daniel On 03/22/2011 08:27 AM, Lai Will wrote: > Hello, > > I'm currently encountering following problem. > > I have a xml file that gets loaded using a custom LoadFunc. > > Boiled down my xml file could look like: > <files> > <file> > <id> > 1 > </id> > <text> > This is a sample text that contains > newlines, which should be preserved when parsing. > </text> > </file> > <file> ...</file> > <file> ...</file> > ... > </files> > > So the text does contain a newline (\r\n or \n does not matter). > When parsing the xml I parse the contents of<text/> into a string and add it > to the list that should be returned by the LoadFunc. > > The problem now is that whenever I dump, store or use the intermediate > result in another UDF e.g. with > > raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( > id:int , text: chararray); dump raw; > > or > > raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( > id:int , text: chararray); clean = FOREACH raw GENERATE id, > org.my.MyCleaner(text) as clean_text; > > The newlines as completely stripped away: > > 1 This is a sample text that contains newlines,which should be > preserved when parsing. > > Or in the latter example leading MyCleaner() to fail.. > > How can I preserve the newline in Pig? > > Best, > Will > > >
