On Thu, Jun 24, 2010 at 3:21 PM, Alexander Aristov <
[email protected]> wrote:

> Hi
>
> When you are thinking about it you should also consider that nutch adds new
> links from fetched documents and so if you want to apply filter on early
> stage you wouldn't get urls which lead to new resources and they won't be
> fetched on the next stages.
>
> So you would want to specify all fetch resources explicitly in the seed
> list.
>
> But I wrote a plugin to filter out only office documents and skip html from
> being indexed.
>
> Best Regards
> Alexander Aristov


Interesting.  Thanks Alexander.

Do you know if there is a way to filter the docs when I do
$ nutch solrindex etc?

-Max

Reply via email to