Try http://scrapy.org/

On Sat, Dec 15, 2012 at 9:17 AM, 高睿 <[email protected]> wrote:

> Hi,
>
> I'm look for a framework to grab articles, then I find Nutch 2.1. Here's
> my plan and questions in each:
> 1. Add article list pages into url/seed.txt
>     Here's one problem. What I actually want to be indexed is the article
> pages, not the article list pages. But, if I don't allow the list page to
> be indexed, Nutch will do nothing because the list page is the entrance.
> So, how can I index only the article page without list pages?
>
> 2. Write a plugin to parse out the 'author', 'date', 'article body',
> 'headline' and maybe other information from html.
>     The 'Parser' plugin interface in Nutch 2.1 is:
>     Parse getParse(String url, WebPage page)
>     And the 'WebPage' class has some predefined attributs:
> public class WebPage extends PersistentBase {
>   //...
>   private Utf8 baseUrl;
>   // ...
>   private Utf8 title;
>   private Utf8 text;
>   // ...
>   private Map<Utf8,ByteBuffer> metadata;
>   // ...
> }
>
>     So, the only field I can put my specified attributes in is the
> 'metadata'. Is it designed for this purpose?
>     BTW, the Parser in trunk looks like: 'public ParseResult
> getParse(Content content)', and seems more reasonable for me.
>
> 3. After the articles are indexed into Solr, another application can query
> it by 'date' then store the article information into Mysql.
>     My question here is: can Nutch store the article directly into Mysql?
> Or can I write a plugin to specify the index behavior?
>
> Is Nutch a good choice for my purpose? If not, do you guys suggest another
> good quality framework/library for me?
> Thanks for your help.
>
> Regards,
> Rui
>



-- 
~Nit
http://about.me/nitinhardeniya




SAVE PAPER - THINK BEFORE YOU PRINT

Reply via email to