Re:Re: How to extend Nutch for article crawling

高睿 Tue, 18 Dec 2012 04:28:06 -0800

Hi,

Thanks for your comments very much. My comments are inline.



At 2012-12-17 22:04:48,"Julien Nioche" <[email protected]> wrote:
>Hi
>
>See comments below
>
>
>> 1. Add article list pages into url/seed.txt
>>     Here's one problem. What I actually want to be indexed is the article
>> pages, not the article list pages. But, if I don't allow the list page to
>> be indexed, Nutch will do nothing because the list page is the entrance.
>> So, how can I index only the article page without list pages?
>>
>
>I think that the indexer can now filter URLs but can't remember whether it
>is for 1.x only or is in 2.x as well. Anyone?
>This would work if you can find a regular expression that captures the list
>pages. Another approach would be to tweak the indexer so that it skips
>documents containing an arbitrary metadatum (e.g. skip.indexing), this
>metadata would be set in a custom parser when processing the list pages.
>
>I think this would be a useful feature to have anyway. URL filters use the
>URL string only and having the option to skip based on metadata would be
>good IMHO
>
>
>>> The callback method in the IndexingFilter has a 'URL' parameter and returns 
>>> NutchDocument, so it is hard to be customized to do this.
>>> So, it's better to add 'skip' ability to the IndexingFilter based on URL or 
>>> medadata.
        @Override
        public NutchDocument filter(NutchDocument doc, String url, WebPage page)

>>
>> 2. Write a plugin to parse out the 'author', 'date', 'article body',
>> 'headline' and maybe other information from html.
>>     The 'Parser' plugin interface in Nutch 2.1 is:
>>     Parse getParse(String url, WebPage page)
>>     And the 'WebPage' class has some predefined attributs:
>> public class WebPage extends PersistentBase {
>>   //...
>>   private Utf8 baseUrl;
>>   // ...
>>   private Utf8 title;
>>   private Utf8 text;
>>   // ...
>>   private Map<Utf8,ByteBuffer> metadata;
>>   // ...
>> }
>>
>>     So, the only field I can put my specified attributes in is the
>> 'metadata'. Is it designed for this purpose?
>>     BTW, the Parser in trunk looks like: 'public ParseResult
>> getParse(Content content)', and seems more reasonable for me.
>>
>
>The extension point Parser is for low level parsing i.e extract text and
>metadata from binary formats, which is done typically by parse-tika. What
>you want to implement is an extension of ParseFilter and add your own
>entries to the parse metadata. The creative commons plugin should be a good
>example to get started
>
>
>>> Very good point. The manual I have read does cover this part. Currently, I 
>>> have my customized Parser to parse the HTML. My parser first delegate the 
>>> parse request to the existing 'HtmlParser' plugin implementation, then 
>>> extract out the detailed information. It's low performance indeed.
>>
>>
>>
>> 3. After the articles are indexed into Solr, another application can query 
>> it by 'date' then store the article information into Mysql.
>>     My question here is: can Nutch store the article directly into Mysql?
>> Or can I write a plugin to specify the index behavior?
>>
>
>you could use the mysql backend in GORA (but it is broken AFAIK) and get
>the other application to use it, alternatively you could write a custom
>indexer that sends directly into MySQL but that would be a bit redundant.
>Do you need to use SOLR at all or is the aim to simply to store in MySQL?
>
>
>
>>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR is an 
>>> amazing tool for indexing, however I'm not quit sure whether it is good to 
>>> store the 'content' inside it. By default, the 'content' is configured only 
>>> to be indexed but stored. What do you think?
>>
>>
 >> Is Nutch a good choice for my purpose? If not, do you guys suggest another
>> good quality framework/library for me?
>>
>
>You can definitely do that with Nutch. There are certainly other resources
>that could be used but they might also need a bit of customisation anyway
>
>HTH
>
>Julien
>
>
>-- 
>*
>*Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com
>http://twitter.com/digitalpebble

Re:Re: How to extend Nutch for article crawling

Reply via email to