Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Markus Jelsma Thu, 30 Jun 2011 04:30:09 -0700

I'm not sure but you could provide your stacktrace. Would atl east make it 
easier.


On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
> Hi everybody,
> 
> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
> websites of, say, house sale ads.
> 
> Everything worked fine until i used only the default Nutch IndexingFilter
> but then I needed some customization to enhance the search results
> quality.
> 
> So I developed a set of plugins (one for each site I need to index) that
> add some custom field to the index (say house price, location, name of the
> seller and so on) and extract those specific data from the html of the
> parsed page.
> 
> Again, everything has run smoothly until the structure of the parsed pages
> stood unchanged. Unfortunately some of the sites that I want to index have
> recently undergone restyling and troubles started for me: now all the
> crawling, fetching, merging etc seems to complete without errors but when
> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
> into Solr database it returns a lot of EOFException, the indexing job
> fails and no document is added to Solr even if just one of the plugins
> fails.
> 
> My questions are: where could the problem be and how can I avoid the
> complete failure of the indexing job? The plugin that parses the modified
> site should manage to fail "cleanly" without affecting the whole process.
> 
> This is the code of the indexing part of the plugin:
> 
> 
> 
> 
> package it.company.searchengine.nutch.plugin.indexer.html.company;
> 
> import
> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.io.Text;
> import org.apache.log4j.Logger;
> import org.apache.nutch.crawl.CrawlDatum;
> import org.apache.nutch.crawl.Inlinks;
> import org.apache.nutch.indexer.IndexingException;
> import org.apache.nutch.indexer.IndexingFilter;
> import org.apache.nutch.indexer.NutchDocument;
> import org.apache.nutch.indexer.lucene.LuceneWriter;
> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
> import org.apache.nutch.parse.Parse;
> 
> public class SiteURL1Indexer implements IndexingFilter {
> 
>     private static final Logger LOGGER =
> Logger.getLogger(SiteURL1Indexer.class); public static final String
> POSITION_KEY = "position";
>     public static final String LOCATION_KEY = "location";
>     public static final String COMPANY_KEY = "company";
>     public static final String DESCRIPTION_KEY = "description";
>     private Configuration conf;
> 
>     public void addIndexBackendOptions(Configuration conf) {
>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
> STORE.YES, INDEX.TOKENIZED, conf);
> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
> INDEX.TOKENIZED, conf); }
> 
>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
> 
>         String position = null;
>         String where = null;
>         String company = null;
>         String description = null;
> 
>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
> 
>         if (SiteURL1Parser.validateField(position)
>                 && SiteURL1Parser.validateField(where)
>                 && SiteURL1Parser.validateField(company)
>                 && SiteURL1Parser.validateField(description)) {
> 
>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
> url.toString()); doc.add(POSITION_KEY, position);
> 
>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
> url.toString()); doc.add(LOCATION_KEY, where);
> 
>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
> url.toString()); doc.add(COMPANY_KEY, company);
> 
>             LOGGER.debug("Adding description: [" + position + "] for URL: "
> + url.toString()); doc.add(DESCRIPTION_KEY, description);
> 
>             return doc;
> 
>         } else {
>             return doc;
>         }
>     }
> 
>     public Configuration getConf() {
>         return this.conf;
>     }
> 
>     public void setConf(Configuration conf) {
>         this.conf = conf;
>     }
> }
> 
> 
> 
> 
> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
> migration to a newer version at the moment.
> 
> 
> Thanks a lot for any hint.
> 
> S
>  
> ----------------------------------
> "Anyone proposing to run Windows on servers should be prepared to explain
> what they know about servers that Google, Yahoo, and Amazon don't."
> Paul Graham
> 
> 
> "A mathematician is a device for turning coffee into theorems."
> Paul Erdos (who obviously never met a sysadmin)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Reply via email to