Hi Stefano, Any further on with this?
I have not looked too much into your code, and sorry to state the obvious but it sounds like a definite link between the number of plugins to the duplication of the data. Are you sure that every one of your plugins is only handling the page it is supposed to? It sounds more like a case that each of your plugins is probably being activiated for all of your pages, doing the parsing stage on every page and also indexing. This would explain the 17X duplication. Are you using any URLfilters or anything of this nature within your plugins? On Thu, Jul 7, 2011 at 10:54 AM, Stefano Cherchi <[email protected]>wrote: > Hello Markus, > > sorry for my late reply. I have finally solved the issue. Actually, it was > my fault: I wasn't using Nutch 1.0 (as I said) but 1.2. Now I rolled back to > 1.0 and everything is working fine. > > But another strange behavior showed up: as I said in my first mail, I have > a plugin for each site I want to index. Each plugin creates 4 custom fields > in the index. At the moment 17 of this plugins are activated. Now when Nutch > puts data into Solr each custom field is filled with 17 identical strings. > The data saved into the custom fields are right, so each plugin is correctly > extracting data from the site it is intended for, but when it performs > indexing it duplicates the datum 17x. > > Quite weird. > > I have pasted here the code of both the parsing and the indexing extensions > of one plugin: > > ####################INDEXING EXTENSION####################### > >> package it.company.searchengine.nutch.plugin.indexer.html.company; > >> > >> import > >> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser; > >> import org.apache.hadoop.conf.Configuration; > >> import org.apache.hadoop.io.Text; > >> import org.apache.log4j.Logger; > >> import org.apache.nutch.crawl.CrawlDatum; > >> import org.apache.nutch.crawl.Inlinks; > >> import org.apache.nutch.indexer.IndexingException; > >> import org.apache.nutch.indexer.IndexingFilter; > >> import org.apache.nutch.indexer.NutchDocument; > >> import org.apache.nutch.indexer.lucene.LuceneWriter; > >> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX; > >> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE; > >> import org.apache.nutch.parse.Parse; > >> > >> public class SiteURL1Indexer implements IndexingFilter { > >> > >> private static final Logger LOGGER = > >> Logger.getLogger(SiteURL1Indexer.class); public static final String > >> POSITION_KEY = "position"; > >> public static final String LOCATION_KEY = "location"; > >> public static final String COMPANY_KEY = "company"; > >> public static final String DESCRIPTION_KEY = "description"; > >> private Configuration conf; > >> > >> public void addIndexBackendOptions(Configuration conf) { > >> LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES, > >> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY, > >> STORE.YES, INDEX.TOKENIZED, conf); > >> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED, > >> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES, > >> INDEX.TOKENIZED, conf); } > >> > >> public NutchDocument filter(NutchDocument doc, Parse parse, Text > url, > >> CrawlDatum datum, Inlinks inlinks) throws IndexingException { > >> > >> String position = null; > >> String where = null; > >> String company = null; > >> String description = null; > >> > >> position = parse.getData().getParseMeta().get(POSITION_KEY); > >> where = parse.getData().getParseMeta().get(LOCATION_KEY); > >> company = parse.getData().getParseMeta().get(COMPANY_KEY); > >> description = > parse.getData().getParseMeta().get(DESCRIPTION_KEY); > >> > >> if (SiteURL1Parser.validateField(position) > >> && SiteURL1Parser.validateField(where) > >> && SiteURL1Parser.validateField(company) > >> && SiteURL1Parser.validateField(description)) { > >> > >> LOGGER.debug("Adding position: [" + position + "] for URL: " > + > >> url.toString()); doc.add(POSITION_KEY, position); > >> > >> LOGGER.debug("Adding location: [" + position + "] for URL: " > + > >> url.toString()); doc.add(LOCATION_KEY, where); > >> > >> LOGGER.debug("Adding company: [" + position + "] for URL: " > + > >> url.toString()); doc.add(COMPANY_KEY, company); > >> > >> LOGGER.debug("Adding description: [" + position + "] for > URL: " > >> + url.toString()); doc.add(DESCRIPTION_KEY, description); > >> > >> return doc; > >> > >> } else { > >> return doc; > >> } > >> } > >> > >> public Configuration getConf() { > >> return this.conf; > >> } > >> > >> public void setConf(Configuration conf) { > >> this.conf = conf; > >> } > >> } > > > > ################PARSING EXTENSION################## > package it.company.searchengine.nutch.plugin.parser.html.company; > > import java.io.BufferedReader; > import java.io.ByteArrayInputStream; > import java.io.IOException; > import java.io.InputStreamReader; > import java.util.regex.Matcher; > import java.util.regex.Pattern; > import org.apache.log4j.Logger; > import org.apache.hadoop.conf.Configuration; > import org.apache.nutch.metadata.Metadata; > import org.apache.nutch.parse.HTMLMetaTags; > import org.apache.nutch.parse.HtmlParseFilter; > import org.apache.nutch.parse.Parse; > import org.apache.nutch.parse.ParseResult; > import org.apache.nutch.protocol.Content; > import org.w3c.dom.DocumentFragment; > > public class SiteURL1Parser implements HtmlParseFilter { > > public static final String POSITION_KEY = "position"; > public static final String LOCATION_KEY = "location"; > public static final String COMPANY_KEY = "company"; > public static final String DESCRIPTION_KEY = "description"; > private static final Logger logger = > Logger.getLogger(SiteURL1Parser.class); > private static final String HTML_TAG_PATTERN = "<[^><]{0,}>"; > private Configuration conf = null; > > public ParseResult filter(Content content, ParseResult parseResult, > HTMLMetaTags metaTags, DocumentFragment doc) { > > String currentURL = null; > String urlPattern = null; > Pattern pattern = null; > Matcher matcher = null; > > currentURL = currentURL = content.getUrl(); > > // SiteURL1.COM > if (currentURL.contains("SiteURL1.com")) { > urlPattern = "^ > http://www.SiteURL1.com/offer[-\\w]{3,}[?]id[=][0-9]{5,10}$"; > pattern = Pattern.compile(urlPattern); > matcher = pattern.matcher(currentURL); > > if (matcher.find()) { > return filterSiteURL1(content, parseResult); > } > } > > return parseResult; > } > > public Configuration getConf() { > return conf; > } > > public void setConf(Configuration conf) { > this.conf = conf; > } > > public static boolean validateField(String field) { > > if (field == null) > return false; > > if (field.equalsIgnoreCase("")) > return false; > > if (field.equalsIgnoreCase("NULL")) > return false; > > return true; > } > > private void printExtractedFields(String position, String company, > String location, String description) { > System.out.println(""); > System.out.println("- POSITION: " + position); > System.out.println("- COMPANY: " + company); > System.out.println("- LOCATION: " + location); > System.out.println("- DESCRIPTION: " + description); > } > > private ParseResult filterSiteURL1(Content content, ParseResult > parseResult) { > > logger.debug("Parsing URL: " + content.getUrl()); > > BufferedReader reader = null; > String currentURL = null; > String line = null; > Parse parse = null; > Metadata metadata = null; > > String company = null; > String position = null; > String location = null; > String description = null; > > boolean intoLocation = false; > boolean intoDescription = false; > > Pattern pattern = null; > Matcher matcher = null; > > try { > > currentURL = content.getUrl(); > description = new String(); > > reader = new BufferedReader(new InputStreamReader(new > ByteArrayInputStream(content.getContent()))); > pattern = Pattern.compile(HTML_TAG_PATTERN); > > while ((line = reader.readLine()) != null) { > > if (line.contains("<tr><td valign=top><a > href='/join/check_session.jsp?idfonte=")) { > line = line.trim(); > matcher = pattern.matcher(line); > company = matcher.replaceAll("").trim(); > continue; > } > > if (line.contains("<tr><td><a > href='/join/check_session.jsp?id=")) { > line = line.trim(); > matcher = pattern.matcher(line); > position = matcher.replaceAll("").trim(); > continue; > } > > if (line.contains("<tr><td > class=\"txt-black-regular-10\"></br><strong>Place</strong>:")) { > intoLocation = true; > continue; > > } else if (intoLocation) { > line = line.trim(); > > if (validateField(line)) { > location = line; > location = > location.replaceAll(" - ", " - "); > intoLocation = false; > } > > continue; > } > > if (line.contains("<span > class=\"txt-black-regular-10\"><strong>Requirements</strong></span>:<br/><a > href='/join/check_session.jsp?id=")) { > > intoDescription = true; > line = line.trim(); > matcher = pattern.matcher(line); > description = matcher.replaceAll("").trim(); > > } else if (intoDescription) { > > line = line.trim(); > > if (validateField(line)) { > > String tmpDescription = null; > matcher = pattern.matcher(line); > tmpDescription = matcher.replaceAll("").trim(); > > if (validateField(tmpDescription)) { > > if (validateField(description)) { > description = description + " " + > tmpDescription; > > } else { > description = tmpDescription; > } > } > } > } > > if (line.contains("</a></span><br/><br/>")) { > > description = description.replaceAll("[\\s]{1,}", " > ").trim(); > > while (description.startsWith("Requirements")) { > > description = > description.replaceFirst("Requirements", "").trim(); > > if (description.startsWith(":")) { > description = description.substring(1).trim(); > } > } > > intoDescription = false; > break; > } > > continue; > } > > reader.close(); > > if (validateField(position)) { > > parse = parseResult.get(currentURL); > metadata = parse.getData().getParseMeta(); > metadata.add(POSITION_KEY, position); > > if (validateField(company)) { > metadata.add(COMPANY_KEY, company); > > } else { > metadata.add(COMPANY_KEY, "Unknow"); > } > > if (validateField(location)) { > metadata.add(LOCATION_KEY, location); > > } else { > metadata.add(LOCATION_KEY, "Unknow"); > } > > if (validateField(description)) { > metadata.add(DESCRIPTION_KEY, description); > > } else { > metadata.add(DESCRIPTION_KEY, ""); > } > } > > } catch (IOException e) { > logger.warn("IOException encountered parsing file:", e); > } > > return parseResult; > } > > > } > > > ---------------------------------- > "Anyone proposing to run Windows on servers should be prepared to explain > what they know about servers that Google, Yahoo, and Amazon don't." > Paul Graham > > > "A mathematician is a device for turning coffee into theorems." > Paul Erdos (who obviously never met a sysadmin) > > > >________________________________ > >Da: Markus Jelsma <[email protected]> > >A: [email protected] > >Cc: Stefano Cherchi <[email protected]> > >Inviato: Giovedì 30 Giugno 2011 13:29 > >Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while > indexing > > > >I'm not sure but you could provide your stacktrace. Would atl east make it > >easier. > > > >On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote: > >> Hi everybody, > >> > >> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of > external > >> websites of, say, house sale ads. > >> > >> Everything worked fine until i used only the default Nutch > IndexingFilter > >> but then I needed some customization to enhance the search results > >> quality. > >> > >> So I developed a set of plugins (one for each site I need to index) that > >> add some custom field to the index (say house price, location, name of > the > >> seller and so on) and extract those specific data from the html of the > >> parsed page. > >> > >> Again, everything has run smoothly until the structure of the parsed > pages > >> stood unchanged. Unfortunately some of the sites that I want to index > have > >> recently undergone restyling and troubles started for me: now all the > >> crawling, fetching, merging etc seems to complete without errors but > when > >> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put > >> into Solr database it returns a lot of EOFException, the indexing job > >> fails and no document is added to Solr even if just one of the plugins > >> fails. > >> > >> My questions are: where could the problem be and how can I avoid the > >> complete failure of the indexing job? The plugin that parses the > modified > >> site should manage to fail "cleanly" without affecting the whole > process. > >> > >> This is the code of the indexing part of the plugin: > >> > >> > >> > >> > >> package it.company.searchengine.nutch.plugin.indexer.html.company; > >> > >> import > >> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser; > >> import org.apache.hadoop.conf.Configuration; > >> import org.apache.hadoop.io.Text; > >> import org.apache.log4j.Logger; > >> import org.apache.nutch.crawl.CrawlDatum; > >> import org.apache.nutch.crawl.Inlinks; > >> import org.apache.nutch.indexer.IndexingException; > >> import org.apache.nutch.indexer.IndexingFilter; > >> import org.apache.nutch.indexer.NutchDocument; > >> import org.apache.nutch.indexer.lucene.LuceneWriter; > >> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX; > >> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE; > >> import org.apache.nutch.parse.Parse; > >> > >> public class SiteURL1Indexer implements IndexingFilter { > >> > >> private static final Logger LOGGER = > >> Logger.getLogger(SiteURL1Indexer.class); public static final String > >> POSITION_KEY = "position"; > >> public static final String LOCATION_KEY = "location"; > >> public static final String COMPANY_KEY = "company"; > >> public static final String DESCRIPTION_KEY = "description"; > >> private Configuration conf; > >> > >> public void addIndexBackendOptions(Configuration conf) { > >> LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES, > >> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY, > >> STORE.YES, INDEX.TOKENIZED, conf); > >> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED, > >> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES, > >> INDEX.TOKENIZED, conf); } > >> > >> public NutchDocument filter(NutchDocument doc, Parse parse, Text > url, > >> CrawlDatum datum, Inlinks inlinks) throws IndexingException { > >> > >> String position = null; > >> String where = null; > >> String company = null; > >> String description = null; > >> > >> position = parse.getData().getParseMeta().get(POSITION_KEY); > >> where = parse.getData().getParseMeta().get(LOCATION_KEY); > >> company = parse.getData().getParseMeta().get(COMPANY_KEY); > >> description = > parse.getData().getParseMeta().get(DESCRIPTION_KEY); > >> > >> if (SiteURL1Parser.validateField(position) > >> && SiteURL1Parser.validateField(where) > >> && SiteURL1Parser.validateField(company) > >> && SiteURL1Parser.validateField(description)) { > >> > >> LOGGER.debug("Adding position: [" + position + "] for URL: " > + > >> url.toString()); doc.add(POSITION_KEY, position); > >> > >> LOGGER.debug("Adding location: [" + position + "] for URL: " > + > >> url.toString()); doc.add(LOCATION_KEY, where); > >> > >> LOGGER.debug("Adding company: [" + position + "] for URL: " > + > >> url.toString()); doc.add(COMPANY_KEY, company); > >> > >> LOGGER.debug("Adding description: [" + position + "] for > URL: " > >> + url.toString()); doc.add(DESCRIPTION_KEY, description); > >> > >> return doc; > >> > >> } else { > >> return doc; > >> } > >> } > >> > >> public Configuration getConf() { > >> return this.conf; > >> } > >> > >> public void setConf(Configuration conf) { > >> this.conf = conf; > >> } > >> } > >> > >> > >> > >> > >> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford > the > >> migration to a newer version at the moment. > >> > >> > >> Thanks a lot for any hint. > >> > >> S > >> > >> ---------------------------------- > >> "Anyone proposing to run Windows on servers should be prepared to > explain > >> what they know about servers that Google, Yahoo, and Amazon don't." > >> Paul Graham > >> > >> > >> "A mathematician is a device for turning coffee into theorems." > >> Paul Erdos (who obviously never met a sysadmin) > > > >-- > >Markus Jelsma - CTO - Openindex > >http://www.linkedin.com/in/markus17 > >050-8536620 / 06-50258350 > > > > > > > -- *Lewis*

