Richard, if you're coming to OSCON or Hadoop Summit, please let me know so I
can buy you a beer. Thanks for the help. This now works for with the excite
log using PigStorage();
It is however still not working with my custom LoadFunc and data. For
reference, I am using Pig 0.8. I have written a custom LoadFunc for Apache
Nutch Segments that reads in each page that is crawled and represents it as
a Tuple of (Url, ContentType, PageContent) as shown in the script below:
webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
using com.hp.demo.SegmentLoader() AS (url:chararray, type:chararray,
content:chararray);
companies = FILTER webcrawl BY (INDEXOF(url,'comp') >= 0);
dump companies;
This keeps failing with ERROR 1071: Cannot convert a
generic_writablecomparable to a String. However, If I change the script to
the following (remove schema type & straight dump after load), it works:
webcrawl = load 'crawled/segments/20110404124435/content/part-00000/data'
using com.hp.demo.SegmentLoader() AS (url, type, content);
dump webcrawl;
Clearly, as soon as I inject types into the Load Schema it starts bombing.
Can anyone tell me what I am doing wrong? I have attached my Nutch LoadFunc
below for reference:
public class SegmentLoader extends FileInputLoadFunc {
private SequenceFileRecordReader<WritableComparable, Content> reader;
protected static final Log LOG = LogFactory.getLog(SegmentLoader.class);
@Override
public void setLocation(String location, Job job) throws IOException {
FileInputFormat.setInputPaths(job, location);
}
@SuppressWarnings("unchecked")
@Override
public InputFormat getInputFormat() throws IOException {
return new SequenceFileInputFormat<WritableComparable, Content>();
}
@SuppressWarnings("unchecked")
@Override
public void prepareToRead(RecordReader reader, PigSplit split) throws
IOException {
this.reader = (SequenceFileRecordReader) reader;
}
@Override
public Tuple getNext() throws IOException {
try {
if (!reader.nextKeyValue()){
return null;
}
Content value = ((Content)reader.getCurrentValue());
String url = value.getUrl();
String type = value.getContentType();
String content = value.getContent().toString();
Tuple tuple = TupleFactory.getInstance().newTuple(3);
tuple.set(0, new DataByteArray(url));
tuple.set(1, new DataByteArray(type));
tuple.set(2, new DataByteArray(content));
return tuple;
} catch (InterruptedException e){
throw new ExecException(e);
}
}
}
On Fri, Apr 22, 2011 at 5:17 PM, Richard Ding <[email protected]> wrote:
> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> query:chararray);
>
> queries = FILTER raw BY (INDEXOF(query,'yahoo') >= 0);
> dump queries;
>
>
> On 4/22/11 2:25 PM, "Steve Watt" <[email protected]> wrote:
>
> Hi Folks
>
> I've done a load of a dataset and I am attempting to filter out unwanted
> records by checking that one of my tuple fields contains a particular
> string. I've distilled this issue down to the sample excite.log that ships
> with Pig for easy recreation. I've read through the INDEXOF code and I
> think
> this should work (lots of queries that contain the word yahoo) but my
> queries dump always contains zero records. Can anyone tell me what I am
> doing wrong?
>
> raw = LOAD 'tutorial/excite.log' USING PigStorage('\t') AS (user, time,
> query);
> queries = FILTER raw BY (INDEXOF(query,'yahoo') > 0);
> dump queries;
>
> Regards
> Steve Watt
>
>