I'm not sure but you could provide your stacktrace. Would atl east make it
easier.
On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
> Hi everybody,
>
> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
> websites of, say, house sale ads.
>
> Everything worked fine until i used only the default Nutch IndexingFilter
> but then I needed some customization to enhance the search results
> quality.
>
> So I developed a set of plugins (one for each site I need to index) that
> add some custom field to the index (say house price, location, name of the
> seller and so on) and extract those specific data from the html of the
> parsed page.
>
> Again, everything has run smoothly until the structure of the parsed pages
> stood unchanged. Unfortunately some of the sites that I want to index have
> recently undergone restyling and troubles started for me: now all the
> crawling, fetching, merging etc seems to complete without errors but when
> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
> into Solr database it returns a lot of EOFException, the indexing job
> fails and no document is added to Solr even if just one of the plugins
> fails.
>
> My questions are: where could the problem be and how can I avoid the
> complete failure of the indexing job? The plugin that parses the modified
> site should manage to fail "cleanly" without affecting the whole process.
>
> This is the code of the indexing part of the plugin:
>
>
>
>
> package it.company.searchengine.nutch.plugin.indexer.html.company;
>
> import
> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.log4j.Logger;
> import org.apache.nutch.crawl.CrawlDatum;
> import org.apache.nutch.crawl.Inlinks;
> import org.apache.nutch.indexer.IndexingException;
> import org.apache.nutch.indexer.IndexingFilter;
> import org.apache.nutch.indexer.NutchDocument;
> import org.apache.nutch.indexer.lucene.LuceneWriter;
> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
> import org.apache.nutch.parse.Parse;
>
> public class SiteURL1Indexer implements IndexingFilter {
>
> private static final Logger LOGGER =
> Logger.getLogger(SiteURL1Indexer.class); public static final String
> POSITION_KEY = "position";
> public static final String LOCATION_KEY = "location";
> public static final String COMPANY_KEY = "company";
> public static final String DESCRIPTION_KEY = "description";
> private Configuration conf;
>
> public void addIndexBackendOptions(Configuration conf) {
> LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
> STORE.YES, INDEX.TOKENIZED, conf);
> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); }
>
> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>
> String position = null;
> String where = null;
> String company = null;
> String description = null;
>
> position = parse.getData().getParseMeta().get(POSITION_KEY);
> where = parse.getData().getParseMeta().get(LOCATION_KEY);
> company = parse.getData().getParseMeta().get(COMPANY_KEY);
> description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>
> if (SiteURL1Parser.validateField(position)
> && SiteURL1Parser.validateField(where)
> && SiteURL1Parser.validateField(company)
> && SiteURL1Parser.validateField(description)) {
>
> LOGGER.debug("Adding position: [" + position + "] for URL: " +
> url.toString()); doc.add(POSITION_KEY, position);
>
> LOGGER.debug("Adding location: [" + position + "] for URL: " +
> url.toString()); doc.add(LOCATION_KEY, where);
>
> LOGGER.debug("Adding company: [" + position + "] for URL: " +
> url.toString()); doc.add(COMPANY_KEY, company);
>
> LOGGER.debug("Adding description: [" + position + "] for URL: "
> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>
> return doc;
>
> } else {
> return doc;
> }
> }
>
> public Configuration getConf() {
> return this.conf;
> }
>
> public void setConf(Configuration conf) {
> this.conf = conf;
> }
> }
>
>
>
>
> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
> migration to a newer version at the moment.
>
>
> Thanks a lot for any hint.
>
> S
>
> ----------------------------------
> "Anyone proposing to run Windows on servers should be prepared to explain
> what they know about servers that Google, Yahoo, and Amazon don't."
> Paul Graham
>
>
> "A mathematician is a device for turning coffee into theorems."
> Paul Erdos (who obviously never met a sysadmin)
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350