Try http://scrapy.org/
On Sat, Dec 15, 2012 at 9:17 AM, 高睿 <[email protected]> wrote:
> Hi,
>
> I'm look for a framework to grab articles, then I find Nutch 2.1. Here's
> my plan and questions in each:
> 1. Add article list pages into url/seed.txt
> Here's one problem. What I actually want to be indexed is the article
> pages, not the article list pages. But, if I don't allow the list page to
> be indexed, Nutch will do nothing because the list page is the entrance.
> So, how can I index only the article page without list pages?
>
> 2. Write a plugin to parse out the 'author', 'date', 'article body',
> 'headline' and maybe other information from html.
> The 'Parser' plugin interface in Nutch 2.1 is:
> Parse getParse(String url, WebPage page)
> And the 'WebPage' class has some predefined attributs:
> public class WebPage extends PersistentBase {
> //...
> private Utf8 baseUrl;
> // ...
> private Utf8 title;
> private Utf8 text;
> // ...
> private Map<Utf8,ByteBuffer> metadata;
> // ...
> }
>
> So, the only field I can put my specified attributes in is the
> 'metadata'. Is it designed for this purpose?
> BTW, the Parser in trunk looks like: 'public ParseResult
> getParse(Content content)', and seems more reasonable for me.
>
> 3. After the articles are indexed into Solr, another application can query
> it by 'date' then store the article information into Mysql.
> My question here is: can Nutch store the article directly into Mysql?
> Or can I write a plugin to specify the index behavior?
>
> Is Nutch a good choice for my purpose? If not, do you guys suggest another
> good quality framework/library for me?
> Thanks for your help.
>
> Regards,
> Rui
>
--
~Nit
http://about.me/nitinhardeniya
SAVE PAPER - THINK BEFORE YOU PRINT