Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can
read up what those entail in Hadoop documentation and books.

As far as dealing with partial records at the beginning and end of the
slice, the normal pattern is to always read a full record even if it takes
you past the configured range, and to ignore any partial records in the
beginning of a slice (because the previous slice will pick them up as part
of its read). So if I was to represent records as letters, and slice
boundaries as dots, something like this:

aaabbb.bbccccdd.ddeee.eeee

Would be read in as follows:

Slice 1: aaabbbbb
Slice 2: (skips bb) ccccdddd
Slice 3: (skips dd) eeeeeee
Slice 4: (skips eeee) -- nothing --

-D

On Tue, Mar 1, 2011 at 12:45 PM, Lai Will <l...@student.ethz.ch> wrote:

> Hello,
>
> The data I want to process is XML. It boils down to
>
> <element>
>                ...
> </element>
> <element>
>                ...
> </element>
>
> According to what I read in the documentation. When loading the file using
> the default Slicer, I end up in block sized chunks, that will very likely
> contain partial <element>s at the beginning and at the end. I don't want to
> ignore those.
> I want to have slice at the element boundaries, and have reasonably sized
> chunks (e.g. the largest chunk that is smaller than block size and that
> contains only whole <element>s.
>
> Unfortunately the user documentation is not very helpful to me, so can
> anyone help me on that?
>
> I found a XMLLoader in the Piggybank but that does not solve my issue with
> slicing.
>
> Best,
> Will
>

Reply via email to