Hi, Thanks for your comments very much. My comments are inline.
At 2012-12-17 22:04:48,"Julien Nioche" <[email protected]> wrote: >Hi > >See comments below > > >> 1. Add article list pages into url/seed.txt >> Here's one problem. What I actually want to be indexed is the article >> pages, not the article list pages. But, if I don't allow the list page to >> be indexed, Nutch will do nothing because the list page is the entrance. >> So, how can I index only the article page without list pages? >> > >I think that the indexer can now filter URLs but can't remember whether it >is for 1.x only or is in 2.x as well. Anyone? >This would work if you can find a regular expression that captures the list >pages. Another approach would be to tweak the indexer so that it skips >documents containing an arbitrary metadatum (e.g. skip.indexing), this >metadata would be set in a custom parser when processing the list pages. > >I think this would be a useful feature to have anyway. URL filters use the >URL string only and having the option to skip based on metadata would be >good IMHO > > >>> The callback method in the IndexingFilter has a 'URL' parameter and returns >>> NutchDocument, so it is hard to be customized to do this. >>> So, it's better to add 'skip' ability to the IndexingFilter based on URL or >>> medadata. @Override public NutchDocument filter(NutchDocument doc, String url, WebPage page) >> >> 2. Write a plugin to parse out the 'author', 'date', 'article body', >> 'headline' and maybe other information from html. >> The 'Parser' plugin interface in Nutch 2.1 is: >> Parse getParse(String url, WebPage page) >> And the 'WebPage' class has some predefined attributs: >> public class WebPage extends PersistentBase { >> //... >> private Utf8 baseUrl; >> // ... >> private Utf8 title; >> private Utf8 text; >> // ... >> private Map<Utf8,ByteBuffer> metadata; >> // ... >> } >> >> So, the only field I can put my specified attributes in is the >> 'metadata'. Is it designed for this purpose? >> BTW, the Parser in trunk looks like: 'public ParseResult >> getParse(Content content)', and seems more reasonable for me. >> > >The extension point Parser is for low level parsing i.e extract text and >metadata from binary formats, which is done typically by parse-tika. What >you want to implement is an extension of ParseFilter and add your own >entries to the parse metadata. The creative commons plugin should be a good >example to get started > > >>> Very good point. The manual I have read does cover this part. Currently, I >>> have my customized Parser to parse the HTML. My parser first delegate the >>> parse request to the existing 'HtmlParser' plugin implementation, then >>> extract out the detailed information. It's low performance indeed. >> >> >> >> 3. After the articles are indexed into Solr, another application can query >> it by 'date' then store the article information into Mysql. >> My question here is: can Nutch store the article directly into Mysql? >> Or can I write a plugin to specify the index behavior? >> > >you could use the mysql backend in GORA (but it is broken AFAIK) and get >the other application to use it, alternatively you could write a custom >indexer that sends directly into MySQL but that would be a bit redundant. >Do you need to use SOLR at all or is the aim to simply to store in MySQL? > > > >>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR is an >>> amazing tool for indexing, however I'm not quit sure whether it is good to >>> store the 'content' inside it. By default, the 'content' is configured only >>> to be indexed but stored. What do you think? >> >> >> Is Nutch a good choice for my purpose? If not, do you guys suggest another >> good quality framework/library for me? >> > >You can definitely do that with Nutch. There are certainly other resources >that could be used but they might also need a bit of customisation anyway > >HTH > >Julien > > >-- >* >*Open Source Solutions for Text Engineering > >http://digitalpebble.blogspot.com/ >http://www.digitalpebble.com >http://twitter.com/digitalpebble

