Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Stefano Cherchi Thu, 30 Jun 2011 04:18:44 -0700

Hi everybody,

I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external 
websites of, say, house sale ads.


Everything worked fine until i used only the default Nutch IndexingFilter but 
then I needed some customization to enhance the search results quality.

So I developed a set of plugins (one for each site I need to index) that add 
some custom field to the index (say house price, location, name of the seller 
and so on) and extract those specific data from the html of the parsed page.

Again, everything has run smoothly until the structure of the parsed pages 
stood unchanged. Unfortunately some of the sites that I want to index have 
recently undergone restyling and troubles started for me: now all the crawling, 
fetching, merging etc seems to complete without errors but when Nutch invokes 
LinkDb (just before solrindexer) to prepare data to be put into Solr database 
it returns a lot of EOFException, the indexing job fails and no document is 
added to Solr even if just one of the plugins fails.

My questions are: where could the problem be and how can I avoid the complete 
failure of the indexing job? The plugin that parses the modified site should 
manage to fail "cleanly" without affecting the whole process.

This is the code of the indexing part of the plugin:




package it.company.searchengine.nutch.plugin.indexer.html.company;

import it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
import org.apache.nutch.parse.Parse;

public class SiteURL1Indexer implements IndexingFilter {

    private static final Logger LOGGER = 
Logger.getLogger(SiteURL1Indexer.class);
    public static final String POSITION_KEY = "position";
    public static final String LOCATION_KEY = "location";
    public static final String COMPANY_KEY = "company";
    public static final String DESCRIPTION_KEY = "description";
    private Configuration conf;

    public void addIndexBackendOptions(Configuration conf) {
        LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES, INDEX.TOKENIZED, 
conf);
        LuceneWriter.addFieldOptions(LOCATION_KEY, STORE.YES, INDEX.TOKENIZED, 
conf);
        LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED, 
conf);
        LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES, 
INDEX.TOKENIZED, conf);
    }

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url, 
CrawlDatum datum, Inlinks inlinks) throws IndexingException {

        String position = null;
        String where = null;
        String company = null;
        String description = null;

        position = parse.getData().getParseMeta().get(POSITION_KEY);
        where = parse.getData().getParseMeta().get(LOCATION_KEY);
        company = parse.getData().getParseMeta().get(COMPANY_KEY);
        description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);

        if (SiteURL1Parser.validateField(position)
                && SiteURL1Parser.validateField(where)
                && SiteURL1Parser.validateField(company)
                && SiteURL1Parser.validateField(description)) {

            LOGGER.debug("Adding position: [" + position + "] for URL: " + 
url.toString());
            doc.add(POSITION_KEY, position);

            LOGGER.debug("Adding location: [" + position + "] for URL: " + 
url.toString());
            doc.add(LOCATION_KEY, where);

            LOGGER.debug("Adding company: [" + position + "] for URL: " + 
url.toString());
            doc.add(COMPANY_KEY, company);

            LOGGER.debug("Adding description: [" + position + "] for URL: " + 
url.toString());
            doc.add(DESCRIPTION_KEY, description);

            return doc;

        } else {
            return doc;
        }
    }

    public Configuration getConf() {
        return this.conf;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }
}




I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the 
migration to a newer version at the moment.


Thanks a lot for any hint.

S
 
---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)

Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Reply via email to