I'm using 0.8.0.

You're right, now I see that I'm actually doing something weird:

I'm using TextInputFormat to read my XML file line by line and construct the 
object to hold the xml element.
Shouldn't actually the XML element <file/>, that spans over several lines be my 
record and not one single line?

Will

-----Original Message-----
From: Daniel Dai [mailto:[email protected]] 
Sent: Tuesday, March 22, 2011 6:24 PM
To: [email protected]
Subject: Re: Preserve newlines in field

Which Pig version are you using? If you are using Pig 0.7/0.8, line parsing is 
handled by hadoop TextInputFormat. You need to override the behavior of 
TextInputFormat in order to do that. You need to derive a new TextInputFormat 
which reserve newline characters, feed it to your LoadFunc(getInputFormat()). 
It is not trivial but doable.

Daniel

On 03/22/2011 08:27 AM, Lai Will wrote:
> Hello,
>
> I'm currently encountering following problem.
>
> I have a xml file that gets loaded using a custom LoadFunc.
>
> Boiled down my xml file could look like:
> <files>
> <file>
> <id>
>                  1
>                  </id>
>                  <text>
>                                  This is a sample text that contains 
> newlines, which should be preserved when parsing.
>                  </text>
> </file>
> <file>  ...</file>
> <file>  ...</file>
> ...
> </files>
>
> So the text does contain a newline (\r\n or \n does not matter).
> When parsing the xml I parse the contents of<text/>  into a string and add it 
> to the list that should be returned by the LoadFunc.
>
> The problem now is that whenever I dump, store or use the intermediate 
> result in another UDF e.g. with
>
> raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( 
> id:int , text: chararray); dump raw;
>
> or
>
> raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( 
> id:int , text: chararray); clean = FOREACH raw GENERATE id, 
> org.my.MyCleaner(text) as clean_text;
>
> The newlines as completely stripped away:
>
> 1              This is a sample text that contains newlines,which should be 
> preserved when parsing.
>
> Or in the latter example leading MyCleaner() to fail..
>
> How can I preserve the newline in Pig?
>
> Best,
> Will
>
>
>

Reply via email to