Re: How to extend Nutch for article crawling

Julien Nioche Mon, 17 Dec 2012 06:05:30 -0800

Hi

See comments below



> 1. Add article list pages into url/seed.txt
>     Here's one problem. What I actually want to be indexed is the article
> pages, not the article list pages. But, if I don't allow the list page to
> be indexed, Nutch will do nothing because the list page is the entrance.
> So, how can I index only the article page without list pages?
>

I think that the indexer can now filter URLs but can't remember whether it
is for 1.x only or is in 2.x as well. Anyone?
This would work if you can find a regular expression that captures the list
pages. Another approach would be to tweak the indexer so that it skips
documents containing an arbitrary metadatum (e.g. skip.indexing), this
metadata would be set in a custom parser when processing the list pages.

I think this would be a useful feature to have anyway. URL filters use the
URL string only and having the option to skip based on metadata would be
good IMHO


>
> 2. Write a plugin to parse out the 'author', 'date', 'article body',
> 'headline' and maybe other information from html.
>     The 'Parser' plugin interface in Nutch 2.1 is:
>     Parse getParse(String url, WebPage page)
>     And the 'WebPage' class has some predefined attributs:
> public class WebPage extends PersistentBase {
>   //...
>   private Utf8 baseUrl;
>   // ...
>   private Utf8 title;
>   private Utf8 text;
>   // ...
>   private Map<Utf8,ByteBuffer> metadata;
>   // ...
> }
>
>     So, the only field I can put my specified attributes in is the
> 'metadata'. Is it designed for this purpose?
>     BTW, the Parser in trunk looks like: 'public ParseResult
> getParse(Content content)', and seems more reasonable for me.
>

The extension point Parser is for low level parsing i.e extract text and
metadata from binary formats, which is done typically by parse-tika. What
you want to implement is an extension of ParseFilter and add your own
entries to the parse metadata. The creative commons plugin should be a good
example to get started


>
> 3. After the articles are indexed into Solr, another application can query
> it by 'date' then store the article information into Mysql.
>     My question here is: can Nutch store the article directly into Mysql?
> Or can I write a plugin to specify the index behavior?
>

you could use the mysql backend in GORA (but it is broken AFAIK) and get
the other application to use it, alternatively you could write a custom
indexer that sends directly into MySQL but that would be a bit redundant.
Do you need to use SOLR at all or is the aim to simply to store in MySQL?


>
> Is Nutch a good choice for my purpose? If not, do you guys suggest another
> good quality framework/library for me?
>

You can definitely do that with Nutch. There are certainly other resources
that could be used but they might also need a bit of customisation anyway

HTH

Julien


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: How to extend Nutch for article crawling

Reply via email to