On Thu, Jun 24, 2010 at 3:21 PM, Alexander Aristov < [email protected]> wrote:
> Hi > > When you are thinking about it you should also consider that nutch adds new > links from fetched documents and so if you want to apply filter on early > stage you wouldn't get urls which lead to new resources and they won't be > fetched on the next stages. > > So you would want to specify all fetch resources explicitly in the seed > list. > > But I wrote a plugin to filter out only office documents and skip html from > being indexed. > > Best Regards > Alexander Aristov Interesting. Thanks Alexander. Do you know if there is a way to filter the docs when I do $ nutch solrindex etc? -Max

