How to extend Nutch for article crawling

高睿 Fri, 14 Dec 2012 19:48:29 -0800

Hi,

I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my 
plan and questions in each:
1. Add article list pages into url/seed.txt
    Here's one problem. What I actually want to be indexed is the article 
pages, not the article list pages. But, if I don't allow the list page to be 
indexed, Nutch will do nothing because the list page is the entrance. So, how 
can I index only the article page without list pages?


2. Write a plugin to parse out the 'author', 'date', 'article body', 'headline' 
and maybe other information from html.
    The 'Parser' plugin interface in Nutch 2.1 is:
    Parse getParse(String url, WebPage page)
    And the 'WebPage' class has some predefined attributs:
public class WebPage extends PersistentBase {
  //...
  private Utf8 baseUrl;
  // ...
  private Utf8 title;
  private Utf8 text;
  // ...
  private Map<Utf8,ByteBuffer> metadata;
  // ...
}

    So, the only field I can put my specified attributes in is the 'metadata'. 
Is it designed for this purpose?
    BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content 
content)', and seems more reasonable for me.

3. After the articles are indexed into Solr, another application can query it 
by 'date' then store the article information into Mysql.
    My question here is: can Nutch store the article directly into Mysql? Or 
can I write a plugin to specify the index behavior?

Is Nutch a good choice for my purpose? If not, do you guys suggest another good 
quality framework/library for me?
Thanks for your help.

Regards,
Rui

How to extend Nutch for article crawling

Reply via email to