Hi,
I'm look for a framework to grab articles, then I find Nutch 2.1. Here's my
plan and questions in each:
1. Add article list pages into url/seed.txt
Here's one problem. What I actually want to be indexed is the article
pages, not the article list pages. But, if I don't allow the list page to be
indexed, Nutch will do nothing because the list page is the entrance. So, how
can I index only the article page without list pages?
2. Write a plugin to parse out the 'author', 'date', 'article body', 'headline'
and maybe other information from html.
The 'Parser' plugin interface in Nutch 2.1 is:
Parse getParse(String url, WebPage page)
And the 'WebPage' class has some predefined attributs:
public class WebPage extends PersistentBase {
//...
private Utf8 baseUrl;
// ...
private Utf8 title;
private Utf8 text;
// ...
private Map<Utf8,ByteBuffer> metadata;
// ...
}
So, the only field I can put my specified attributes in is the 'metadata'.
Is it designed for this purpose?
BTW, the Parser in trunk looks like: 'public ParseResult getParse(Content
content)', and seems more reasonable for me.
3. After the articles are indexed into Solr, another application can query it
by 'date' then store the article information into Mysql.
My question here is: can Nutch store the article directly into Mysql? Or
can I write a plugin to specify the index behavior?
Is Nutch a good choice for my purpose? If not, do you guys suggest another good
quality framework/library for me?
Thanks for your help.
Regards,
Rui