Which Pig version are you using? If you are using Pig 0.7/0.8, line
parsing is handled by hadoop TextInputFormat. You need to override the
behavior of TextInputFormat in order to do that. You need to derive a
new TextInputFormat which reserve newline characters, feed it to your
LoadFunc(getInputFormat()). It is not trivial but doable.
Daniel
On 03/22/2011 08:27 AM, Lai Will wrote:
Hello,
I'm currently encountering following problem.
I have a xml file that gets loaded using a custom LoadFunc.
Boiled down my xml file could look like:
<files>
<file>
<id>
1
</id>
<text>
This is a sample text that contains newlines,
which should be preserved when parsing.
</text>
</file>
<file> ...</file>
<file> ...</file>
...
</files>
So the text does contain a newline (\r\n or \n does not matter).
When parsing the xml I parse the contents of<text/> into a string and add it
to the list that should be returned by the LoadFunc.
The problem now is that whenever I dump, store or use the intermediate result
in another UDF e.g. with
raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int ,
text: chararray);
dump raw;
or
raw = LOAD 'data/files.xml' using org.my.MyCustomXMLLoader() AS ( id:int ,
text: chararray);
clean = FOREACH raw GENERATE id, org.my.MyCleaner(text) as clean_text;
The newlines as completely stripped away:
1 This is a sample text that contains newlines,which should be
preserved when parsing.
Or in the latter example leading MyCleaner() to fail..
How can I preserve the newline in Pig?
Best,
Will