Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Stefano Cherchi Thu, 07 Jul 2011 02:55:23 -0700

Hello Markus,

sorry for my late reply. I have finally solved the issue. Actually, it was my 
fault: I wasn't using Nutch 1.0 (as I said) but 1.2. Now I rolled back to 1.0 
and everything is working fine.


But another strange behavior showed up: as I said in my first mail, I have a 
plugin for each site I want to index. Each plugin creates 4 custom fields in 
the index. At the moment 17 of this plugins are activated. Now when Nutch puts 
data into Solr each custom field is filled with 17 identical strings. The data 
saved into the custom fields are right, so each plugin is correctly extracting 
data from the site it is intended for, but when it performs indexing it 
duplicates the datum 17x.

Quite weird.

I have pasted here the code of both the parsing and the indexing extensions of 
one plugin:

####################INDEXING EXTENSION#######################
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>> 
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>> 
>> public class SiteURL1Indexer implements IndexingFilter {
>> 
>>     private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>>     public static final String LOCATION_KEY = "location";
>>     public static final String COMPANY_KEY = "company";
>>     public static final String DESCRIPTION_KEY = "description";
>>     private Configuration conf;
>> 
>>     public void addIndexBackendOptions(Configuration conf) {
>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>> 
>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>> 
>>         String position = null;
>>         String where = null;
>>         String company = null;
>>         String description = null;
>> 
>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>> 
>>         if (SiteURL1Parser.validateField(position)
>>                 && SiteURL1Parser.validateField(where)
>>                 && SiteURL1Parser.validateField(company)
>>                 && SiteURL1Parser.validateField(description)) {
>> 
>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>> 
>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>> 
>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>> 
>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>> 
>>             return doc;
>> 
>>         } else {
>>             return doc;
>>         }
>>     }
>> 
>>     public Configuration getConf() {
>>         return this.conf;
>>     }
>> 
>>     public void setConf(Configuration conf) {
>>         this.conf = conf;
>>     }
>> }



################PARSING EXTENSION##################
package it.company.searchengine.nutch.plugin.parser.html.company;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.protocol.Content;
import org.w3c.dom.DocumentFragment;

public class SiteURL1Parser implements HtmlParseFilter {

    public static final String POSITION_KEY = "position";
    public static final String LOCATION_KEY = "location";
    public static final String COMPANY_KEY = "company";
    public static final String DESCRIPTION_KEY = "description";
    private static final Logger logger = Logger.getLogger(SiteURL1Parser.class);
    private static final String HTML_TAG_PATTERN = "<[^><]{0,}>";
    private Configuration conf = null;

    public ParseResult filter(Content content, ParseResult parseResult, 
HTMLMetaTags metaTags, DocumentFragment doc) {

        String currentURL = null;
        String urlPattern = null;
        Pattern pattern = null;
        Matcher matcher = null;

        currentURL = currentURL = content.getUrl();

        //  SiteURL1.COM
        if (currentURL.contains("SiteURL1.com")) {
            urlPattern = 
"^http://www.SiteURL1.com/offer[-\\w]{3,}[?]id[=][0-9]{5,10}$";;
            pattern = Pattern.compile(urlPattern);
            matcher = pattern.matcher(currentURL);

            if (matcher.find()) {
                return filterSiteURL1(content, parseResult);
            }
        }

        return parseResult;
    }

    public Configuration getConf() {
        return conf;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public static boolean validateField(String field) {

        if (field == null)
            return false;

        if (field.equalsIgnoreCase(""))
            return false;

        if (field.equalsIgnoreCase("NULL"))
            return false;

        return true;
    }

    private void printExtractedFields(String position, String company, String 
location, String description) {
        System.out.println("");
        System.out.println("- POSITION:    " + position);
        System.out.println("- COMPANY:     " + company);
        System.out.println("- LOCATION:    " + location);
        System.out.println("- DESCRIPTION: " + description);
    }

    private ParseResult filterSiteURL1(Content content, ParseResult 
parseResult) {

        logger.debug("Parsing URL: " + content.getUrl());

        BufferedReader reader = null;
        String currentURL = null;
        String line = null;
        Parse parse = null;
        Metadata metadata = null;

        String company = null;
        String position = null;
        String location = null;
        String description = null;

        boolean intoLocation = false;
        boolean intoDescription = false;

        Pattern pattern = null;
        Matcher matcher = null;

        try {

            currentURL = content.getUrl();
            description = new String();

            reader = new BufferedReader(new InputStreamReader(new 
ByteArrayInputStream(content.getContent())));
            pattern = Pattern.compile(HTML_TAG_PATTERN);

            while ((line = reader.readLine()) != null) {

                if (line.contains("<tr><td valign=top><a 
href='/join/check_session.jsp?idfonte=")) {
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    company = matcher.replaceAll("").trim();
                    continue;
                }

                if (line.contains("<tr><td><a 
href='/join/check_session.jsp?id=")) {
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    position = matcher.replaceAll("").trim();
                    continue;
                }

                if (line.contains("<tr><td 
class=\"txt-black-regular-10\"></br><strong>Place</strong>:")) {
                    intoLocation = true;
                    continue;

                } else if (intoLocation) {
                    line = line.trim();

                    if (validateField(line)) {
                        location = line;
                        location = 
location.replaceAll("&nbsp;&nbsp;-&nbsp;&nbsp;", " - ");
                        intoLocation = false;
                    }

                    continue;
                }

                if (line.contains("<span 
class=\"txt-black-regular-10\"><strong>Requirements</strong></span>:<br/><a 
href='/join/check_session.jsp?id=")) {

                    intoDescription = true;
                    line = line.trim();
                    matcher = pattern.matcher(line);
                    description = matcher.replaceAll("").trim();

                } else if (intoDescription) {

                    line = line.trim();

                    if (validateField(line)) {

                        String tmpDescription = null;
                        matcher = pattern.matcher(line);
                        tmpDescription = matcher.replaceAll("").trim();

                        if (validateField(tmpDescription)) {

                            if (validateField(description)) {
                                description = description + " " + 
tmpDescription;

                            } else {
                                description = tmpDescription;
                            }
                        }
                    }
                }

                if (line.contains("</a></span><br/><br/>")) {

                    description = description.replaceAll("[\\s]{1,}", " 
").trim();

                    while (description.startsWith("Requirements")) {

                        description = description.replaceFirst("Requirements", 
"").trim();

                        if (description.startsWith(":")) {
                            description = description.substring(1).trim();
                        }
                    }

                    intoDescription = false;
                    break;
                }

                continue;
            }

            reader.close();

            if (validateField(position)) {

                parse = parseResult.get(currentURL);
                metadata = parse.getData().getParseMeta();
                metadata.add(POSITION_KEY, position);

                if (validateField(company)) {
                    metadata.add(COMPANY_KEY, company);

                } else {
                    metadata.add(COMPANY_KEY, "Unknow");
                }

                if (validateField(location)) {
                    metadata.add(LOCATION_KEY, location);

                } else {
                    metadata.add(LOCATION_KEY, "Unknow");
                }

                if (validateField(description)) {
                    metadata.add(DESCRIPTION_KEY, description);

                } else {
                    metadata.add(DESCRIPTION_KEY, "");
                }
            }

        } catch (IOException e) {
            logger.warn("IOException encountered parsing file:", e);
        }

        return parseResult;
    }

   
}


---------------------------------- 
"Anyone proposing to run Windows on servers should be prepared to explain 
what they know about servers that Google, Yahoo, and Amazon don't."
Paul Graham


"A mathematician is a device for turning coffee into theorems."
Paul Erdos (who obviously never met a sysadmin)


>________________________________
>Da: Markus Jelsma <[email protected]>
>A: [email protected]
>Cc: Stefano Cherchi <[email protected]>
>Inviato: Giovedì 30 Giugno 2011 13:29
>Oggetto: Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while 
>indexing
>
>I'm not sure but you could provide your stacktrace. Would atl east make it 
>easier.
>
>On Thursday 30 June 2011 13:18:06 Stefano Cherchi wrote:
>> Hi everybody,
>> 
>> I have a 4-nodes nutch+hadoop+solr stack that indexes a bunch of external
>> websites of, say, house sale ads.
>> 
>> Everything worked fine until i used only the default Nutch IndexingFilter
>> but then I needed some customization to enhance the search results
>> quality.
>> 
>> So I developed a set of plugins (one for each site I need to index) that
>> add some custom field to the index (say house price, location, name of the
>> seller and so on) and extract those specific data from the html of the
>> parsed page.
>> 
>> Again, everything has run smoothly until the structure of the parsed pages
>> stood unchanged. Unfortunately some of the sites that I want to index have
>> recently undergone restyling and troubles started for me: now all the
>> crawling, fetching, merging etc seems to complete without errors but when
>> Nutch invokes LinkDb (just before solrindexer) to prepare data to be put
>> into Solr database it returns a lot of EOFException, the indexing job
>> fails and no document is added to Solr even if just one of the plugins
>> fails.
>> 
>> My questions are: where could the problem be and how can I avoid the
>> complete failure of the indexing job? The plugin that parses the modified
>> site should manage to fail "cleanly" without affecting the whole process.
>> 
>> This is the code of the indexing part of the plugin:
>> 
>> 
>> 
>> 
>> package it.company.searchengine.nutch.plugin.indexer.html.company;
>> 
>> import
>> it.company.searchengine.nutch.plugin.parser.html.company.SiteURL1Parser;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.io.Text;
>> import org.apache.log4j.Logger;
>> import org.apache.nutch.crawl.CrawlDatum;
>> import org.apache.nutch.crawl.Inlinks;
>> import org.apache.nutch.indexer.IndexingException;
>> import org.apache.nutch.indexer.IndexingFilter;
>> import org.apache.nutch.indexer.NutchDocument;
>> import org.apache.nutch.indexer.lucene.LuceneWriter;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
>> import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
>> import org.apache.nutch.parse.Parse;
>> 
>> public class SiteURL1Indexer implements IndexingFilter {
>> 
>>     private static final Logger LOGGER =
>> Logger.getLogger(SiteURL1Indexer.class); public static final String
>> POSITION_KEY = "position";
>>     public static final String LOCATION_KEY = "location";
>>     public static final String COMPANY_KEY = "company";
>>     public static final String DESCRIPTION_KEY = "description";
>>     private Configuration conf;
>> 
>>     public void addIndexBackendOptions(Configuration conf) {
>>         LuceneWriter.addFieldOptions(POSITION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); LuceneWriter.addFieldOptions(LOCATION_KEY,
>> STORE.YES, INDEX.TOKENIZED, conf);
>> LuceneWriter.addFieldOptions(COMPANY_KEY, STORE.YES, INDEX.TOKENIZED,
>> conf); LuceneWriter.addFieldOptions(DESCRIPTION_KEY, STORE.YES,
>> INDEX.TOKENIZED, conf); }
>> 
>>     public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>> CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>> 
>>         String position = null;
>>         String where = null;
>>         String company = null;
>>         String description = null;
>> 
>>         position = parse.getData().getParseMeta().get(POSITION_KEY);
>>         where = parse.getData().getParseMeta().get(LOCATION_KEY);
>>         company = parse.getData().getParseMeta().get(COMPANY_KEY);
>>         description = parse.getData().getParseMeta().get(DESCRIPTION_KEY);
>> 
>>         if (SiteURL1Parser.validateField(position)
>>                 && SiteURL1Parser.validateField(where)
>>                 && SiteURL1Parser.validateField(company)
>>                 && SiteURL1Parser.validateField(description)) {
>> 
>>             LOGGER.debug("Adding position: [" + position + "] for URL: " +
>> url.toString()); doc.add(POSITION_KEY, position);
>> 
>>             LOGGER.debug("Adding location: [" + position + "] for URL: " +
>> url.toString()); doc.add(LOCATION_KEY, where);
>> 
>>             LOGGER.debug("Adding company: [" + position + "] for URL: " +
>> url.toString()); doc.add(COMPANY_KEY, company);
>> 
>>             LOGGER.debug("Adding description: [" + position + "] for URL: "
>> + url.toString()); doc.add(DESCRIPTION_KEY, description);
>> 
>>             return doc;
>> 
>>         } else {
>>             return doc;
>>         }
>>     }
>> 
>>     public Configuration getConf() {
>>         return this.conf;
>>     }
>> 
>>     public void setConf(Configuration conf) {
>>         this.conf = conf;
>>     }
>> }
>> 
>> 
>> 
>> 
>> I'm running Nutch 1.0. Yes, I know it's an old one but I cannot afford the
>> migration to a newer version at the moment.
>> 
>> 
>> Thanks a lot for any hint.
>> 
>> S
>>  
>> ----------------------------------
>> "Anyone proposing to run Windows on servers should be prepared to explain
>> what they know about servers that Google, Yahoo, and Amazon don't."
>> Paul Graham
>> 
>> 
>> "A mathematician is a device for turning coffee into theorems."
>> Paul Erdos (who obviously never met a sysadmin)
>
>-- 
>Markus Jelsma - CTO - Openindex
>http://www.linkedin.com/in/markus17
>050-8536620 / 06-50258350
>
>
>

Re: Nutch + Hadoop + Solr: custom plugin cause EOFException while indexing

Reply via email to