Hi everybody,
I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
websites of, say, house sale ads.
Everything worked fine until i used only the default Nutch IndexingFilter but
then I needed some customization to enhance the search results quality.
So I developed a set of plugins (one for each site I need to index) that add
some custom field to the index (say house price, location, name of the seller
and so on) and extract those specific data from the html of the parsed page.
Again, everything has run smoothly until the structure of the parsed pages
stood unchanged. Unfortunately some of the sites that I want to index have
recently undergone restyling and troubles started for me: now all the crawling,
fetching, merging etc seems to complete without errors but when Nutch invokes
LinkDb (just before solrindexer) to prepare data to be put into Solr database
it returns a lot of EOFException, the indexing job fails and no document is
added to Solr even if just one of the plugins fails.
My questions are: where could the problem be and how can I avoid the complete
failure of the indexing job? The plugin that parses the modified site should
manage to fail "cleanly" without affecting the whole process.
This is the code of the indexing part of the plugin:
package it.company.searchengine.nutch.plugin.indexer.html.company;
import it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
import org.apache.nutch.parse.Parse;
public class SiteURL1Indexer implements IndexingFilter {
private static final Logger LOGGER =
Logger.getLogger(SiteURL1Indexer.class);
public static final String POSITION_KEY = "position";
public static final String LOCATION_KEY = "location";
public static final String COMPANY_KEY = "company";
public static final String DESCRIPTION_KEY = "description";
private Configuration conf;
public void addIndexBackendOptions(Configuration conf) {
LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES, INDEX.TOKENIZED,
conf);
LuceneWriter.addFieldOptions(LOCATION_KEY, STORE.YES, INDEX.TOKENIZED,
conf);
LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
conf);
LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
INDEX.TOKENIZED, conf);
}
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
String position = null;
String where = null;
String company = null;
String description = null;
position = parse.getData().getParseMeta().get(POSITION_KEY);
where = parse.getData().getParseMeta().get(LOCATION_KEY);
company = parse.getData().getParseMeta().get(COMPANY_KEY);
description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
if (SiteURL1Parser.validateField(position)
&& SiteURL1Parser.validateField(where)
&& SiteURL1Parser.validateField(company)
&& SiteURL1Parser.validateField(description)) {
LOGGER.debug("Adding position: [" + position + "] for URL: " +
url.toString());
doc.add(POSITION_KEY, position);
LOGGER.debug("Adding location: [" + position + "] for URL: " +
url.toString());
doc.add(LOCATION_KEY, where);
LOGGER.debug("Adding company: [" + position + "] for URL: " +
url.toString());
doc.add(COMPANY_KEY, company);
LOGGER.debug("Adding description: [" + position + "] for URL: " +
url.toString());
doc.add(DESCRIPTION_KEY, description);
return doc;
} else {
return doc;
}
}
public Configuration getConf() {
return this.conf;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
}
I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
migration to a newer version at the moment.
Thanks a lot for any hint.
S
----------------------------------
"Anyone proposing to run Windows on servers should be prepared to explain
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham
"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)