Hi

When you are thinking about it you should also consider that nutch adds new
links from fetched documents and so if you want to apply filter on early
stage you wouldn't get urls which lead to new resources and they won't be
fetched on the next stages.

So you would want to specify all fetch resources explicitly in the seed
list.

But I wrote a plugin to filter out only office documents and skip html from
being indexed.

Best Regards
Alexander Aristov


On 24 June 2010 21:08, Max Lynch <[email protected]> wrote:

> Hi,
> I would like to crawl a list of pages but only index PDFs.  From what I
> gather I can add an exclusion for all non .pdf extensions in
> crawl-urlfilter.txt.
>
> However, I would also like to apply an additional restriction, that I only
> index pages that match a certain query.  In my head, this doesn't seem to
> be
> a great way of doing things, since the documents won't be optimized for
> searching until they are in nutch's lucene index (AFAIK), but could I do
> either of the following?
>
>   1. Write a plugin to do a naive full text search of each "content" field
>   before it is indexed and stopping the indexing if my term isn't found
>   2. Add a restriction to my solrindex import that only grabs documents
>   matching a certain query
>
> The last one appeals to me in more ways than one, since my nutch index
> isn't
> the official index for the rest of my application.  I could see myself
> applying a quick filter from nutch to solr, only giving solr what I want.
>  But that means I waste time crawling stuff I don't need.
>
> Is there a way to accomplish this, especially option #2?
>
> Thanks!
>

Reply via email to