Ok, thanks.. i'll definitely keep that in mind. As i'm still using pig 0.6.0 
because the old code base i'm working on was using it.

Ah so does the behavior you mentioned also apply to 0.6.0 then i would have no 
issue here:)

I was just thinking that as different chunks get processed in different 
tasks,that are shipped to different machines, reading past the record would not 
be possible as the next record might not be available on the machine. (So is 
there actually an extra dfs access performed in order to read past the record?)

Will

This message was sent from my mobile phone. I apologize for any typos and 
abbreviations.

----- Reply message -----
From: "Dmitriy Ryaboy" <[email protected]>
Date: Tue, Mar 1, 2011 22:05
Subject: Custom Slicer
To: "[email protected]" <[email protected]>
Cc: "Lai Will" <[email protected]>

Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can 
read up what those entail in Hadoop documentation and books.

As far as dealing with partial records at the beginning and end of the slice, 
the normal pattern is to always read a full record even if it takes you past 
the configured range, and to ignore any partial records in the beginning of a 
slice (because the previous slice will pick them up as part of its read). So if 
I was to represent records as letters, and slice boundaries as dots, something 
like this:

aaabbb.bbccccdd.ddeee.eeee

Would be read in as follows:

Slice 1: aaabbbbb
Slice 2: (skips bb) ccccdddd
Slice 3: (skips dd) eeeeeee
Slice 4: (skips eeee) -- nothing --

-D

On Tue, Mar 1, 2011 at 12:45 PM, Lai Will 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

The data I want to process is XML. It boils down to

<element>
               ...
</element>
<element>
               ...
</element>

According to what I read in the documentation. When loading the file using the 
default Slicer, I end up in block sized chunks, that will very likely contain 
partial <element>s at the beginning and at the end. I don't want to ignore 
those.
I want to have slice at the element boundaries, and have reasonably sized 
chunks (e.g. the largest chunk that is smaller than block size and that 
contains only whole <element>s.

Unfortunately the user documentation is not very helpful to me, so can anyone 
help me on that?

I found a XMLLoader in the Piggybank but that does not solve my issue with 
slicing.

Best,
Will

Reply via email to