Slicers are deprecated -- Pig now uses Hadoop InputFormats directly; you can read up what those entail in Hadoop documentation and books.
As far as dealing with partial records at the beginning and end of the slice, the normal pattern is to always read a full record even if it takes you past the configured range, and to ignore any partial records in the beginning of a slice (because the previous slice will pick them up as part of its read). So if I was to represent records as letters, and slice boundaries as dots, something like this: aaabbb.bbccccdd.ddeee.eeee Would be read in as follows: Slice 1: aaabbbbb Slice 2: (skips bb) ccccdddd Slice 3: (skips dd) eeeeeee Slice 4: (skips eeee) -- nothing -- -D On Tue, Mar 1, 2011 at 12:45 PM, Lai Will <l...@student.ethz.ch> wrote: > Hello, > > The data I want to process is XML. It boils down to > > <element> > ... > </element> > <element> > ... > </element> > > According to what I read in the documentation. When loading the file using > the default Slicer, I end up in block sized chunks, that will very likely > contain partial <element>s at the beginning and at the end. I don't want to > ignore those. > I want to have slice at the element boundaries, and have reasonably sized > chunks (e.g. the largest chunk that is smaller than block size and that > contains only whole <element>s. > > Unfortunately the user documentation is not very helpful to me, so can > anyone help me on that? > > I found a XMLLoader in the Piggybank but that does not solve my issue with > slicing. > > Best, > Will >