-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hey,
I recently developed a custom ParseFilter and a IndexingFilter with
quite similar effects (DOM did not get entirely parsed for some
unknown reason, no Exceptions at all). Solution was to disable some of
the optional plugins I copied from some example page. And suddenly
everthing went fine.
So maybe try to remove the plugins one by one until everything works
fine again to narrow your problem a bit.
Am 30.06.11 13:29, schrieb Markus Jelsma:
> I'm not sure but you could provide your stacktrace. Would atl east make it
> easier.
>
> On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
>> Hi everybody,
>>
>> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
>> websites of, say, house sale ads.
>>
>> Everything worked fine until i used only the default Nutch IndexingFilter
>> but then I needed some customization to enhance the search results
>> quality.
>>
>> So I developed a set of plugins (one for each site I need to index) that
>> add some custom field to the index (say house price, location, name of the
>> seller and so on) and extract those specific data from the html of the
>> parsed page.
>>
>> Again, everything has run smoothly until the structure of the parsed pages
>> stood unchanged. Unfortunately some of the sites that I want to index have
>> recently undergone restyling and troubles started for me: now all the
>> crawling, fetching, merging etc seems to complete without errors but when
>> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
>> into Solr database it returns a lot of EOFException, the indexing job
>> fails and no document is added to Solr even if just one of the plugins
>> fails.
>>
>> My questions are: where could the problem be and how can I avoid the
>> complete failure of the indexing job? The plugin that parses the modified
>> site should manage to fail "cleanly" without affecting the whole process.
>>
>> This is the code of the indexing part of the plugin:
>>
>>
>>
>>
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>>
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>>
>> public class SiteURL1Indexer implements IndexingFilter {
>>
>> private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>> public static final String LOCATION_KEY = "location";
>> public static final String COMPANY_KEY = "company";
>> public static final String DESCRIPTION_KEY = "description";
>> private Configuration conf;
>>
>> public void addIndexBackendOptions(Configuration conf) {
>> LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>>
>> public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>>
>> String position = null;
>> String where = null;
>> String company = null;
>> String description = null;
>>
>> position = parse.getData().getParseMeta().get(POSITION_KEY);
>> where = parse.getData().getParseMeta().get(LOCATION_KEY);
>> company = parse.getData().getParseMeta().get(COMPANY_KEY);
>> description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>>
>> if (SiteURL1Parser.validateField(position)
>> && SiteURL1Parser.validateField(where)
>> && SiteURL1Parser.validateField(company)
>> && SiteURL1Parser.validateField(description)) {
>>
>> LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>>
>> LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>>
>> LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>>
>> LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>>
>> return doc;
>>
>> } else {
>> return doc;
>> }
>> }
>>
>> public Configuration getConf() {
>> return this.conf;
>> }
>>
>> public void setConf(Configuration conf) {
>> this.conf = conf;
>> }
>> }
>>
>>
>>
>>
>> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
>> migration to a newer version at the moment.
>>
>>
>> Thanks a lot for any hint.
>>
>> S
>>
>> ----------------------------------
>> "Anyone proposing to run Windows on servers should be prepared to explain
>> what they know about servers that Google, Yahoo, and Amazon don't."
>> Paul Graham
>>
>>
>> "A mathematician is a device for turning coffee into theorems."
>> Paul Erdos (who obviously never met a sysadmin)
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk4MdksACgkQzp84az+gLK286gCfcMuylGGTDIYEJtqIFchDK4oS
/30Anji2WwaePYfZBWdNW9VVlpoA0Ila
=JZmM
-----END PGP SIGNATURE-----