Which Pig version are you using? If you are using Pig 0.7/0.8, line parsing is handled by hadoop TextInputFormat. You need to override the behavior of TextInputFormat in order to do that. You need to derive a new TextInputFormat which reserve newline characters, feed it to your LoadFunc(getInputFormat()). It is not trivial but doable.

Daniel

On 03/22/2011 08:27 AM, Lai Will wrote:
Hello,

I'm currently encountering following problem.

I have a xml file that gets loaded using a custom LoadFunc.

Boiled down my xml file could look like:
<files>
<file>
<id>
                 1
                 </id>
                 <text>
                                 This is a sample text that contains newlines,
which should be preserved when parsing.
                 </text>
</file>
<file>  ...</file>
<file>  ...</file>
...
</files>

So the text does contain a newline (\r\n or \n does not matter).
When parsing the xml I parse the contents of<text/>  into a string and add it 
to the list that should be returned by the LoadFunc.

The problem now is that whenever I dump, store or use the intermediate result 
in another UDF e.g. with

raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int , 
text: chararray);
dump raw;

or

raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int , 
text: chararray);
clean = FOREACH raw GENERATE id, org.my.MyCleaner(text) as clean_text;

The newlines as completely stripped away:

1              This is a sample text that contains newlines,which should be 
preserved when parsing.

Or in the latter example leading MyCleaner() to fail..

How can I preserve the newline in Pig?

Best,
Will




Reply via email to