Re: Pig FILTER with INDEXOF not working

Dmitriy Ryaboy Fri, 22 Apr 2011 18:06:49 -0700

If the expected return type of your loader is (String, String, String) you
should just put Strings into the tuple (no conversion to DataByteArrays) and
report your schema to Pig via
an implementation of LoadMetadata.getSchema()


D

On Fri, Apr 22, 2011 at 5:30 PM, Steve Watt <[email protected]> wrote:

> Richard, if you're coming to OSCON or Hadoop Summit, please let me know so
> I
> can buy you a beer. Thanks for the help. This now works for with the excite
> log using PigStorage();
>
> It is however still not working with my custom LoadFunc and data. For
> reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
> Nutch Segments that reads in each page that is crawled and represents it as
> a Tuple of (Url, ContentType, PageContent) as shown in the script below:
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
> using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
> content:chararray);
> companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
> dump companies;
>
> This keeps failing with ERROR 1071: Cannot convert a
> generic_writablecomparable to a String. However, If I change the script to
> the following (remove schema type & straight dump after load), it works:
>
> webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
> using com.hp.demo.SegmentLoader() AS (url, type, content);
> dump webcrawl;
>
> Clearly, as soon as I inject types into the Load Schema it starts bombing.
> Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
> below for reference:
>
> public class SegmentLoader extends FileInputLoadFunc {
>
> private SequenceFileRecordReader<WritableComparable, Content> reader;
> protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
>  @Override
> public void setLocation(String location, Job job) throws IOException {
> FileInputFormat.setInputPaths(job, location);
> }
>  @SuppressWarnings("unchecked")
> @Override
> public InputFormat getInputFormat() throws IOException {
> return new SequenceFileInputFormat<WritableComparable, Content>();
> }
>
> @SuppressWarnings("unchecked")
> @Override
> public void prepareToRead(RecordReader reader, PigSplit split) throws
> IOException {
> this.reader = (SequenceFileRecordReader) reader;
> }
>
> @Override
> public Tuple getNext() throws IOException {
> try {
> if (!reader.nextKeyValue()){
> return null;
> }
>  Content value = ((Content)reader.getCurrentValue());
> String url = value.getUrl();
> String type = value.getContentType();
> String content = value.getContent().toString();
>  Tuple tuple  = TupleFactory.getInstance().newTuple(3);
> tuple.set(0, new DataByteArray(url));
> tuple.set(1, new DataByteArray(type));
> tuple.set(2, new DataByteArray(content));
> return tuple;
> } catch (InterruptedException e){
> throw new ExecException(e);
> }
> }
>
> }
>
> On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[email protected]> wrote:
>
> >  raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> > query:chararray);
> >
> > queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
> > dump queries;
> >
> >
> > On 4/22/11 2:25 PM, "Steve Watt" <[email protected]> wrote:
> >
> > Hi Folks
> >
> > I've done a load of a dataset and I am attempting to filter out unwanted
> > records by checking that one of my tuple fields contains a particular
> > string. I've distilled this issue down to the sample excite.log that
> ships
> > with Pig for easy recreation. I've read through the INDEXOF code and I
> > think
> > this should work (lots of queries that contain the word yahoo) but my
> > queries dump always contains zero records. Can anyone tell me what I am
> > doing wrong?
> >
> > raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> > query);
> > queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
> > dump queries;
> >
> > Regards
> > Steve Watt
> >
> >
>

Re: Pig FILTER with INDEXOF not working

Reply via email to