Hi, I would like to crawl a list of pages but only index PDFs. From what I gather I can add an exclusion for all non .pdf extensions in crawl-urlfilter.txt.
However, I would also like to apply an additional restriction, that I only index pages that match a certain query. In my head, this doesn't seem to be a great way of doing things, since the documents won't be optimized for searching until they are in nutch's lucene index (AFAIK), but could I do either of the following? 1. Write a plugin to do a naive full text search of each "content" field before it is indexed and stopping the indexing if my term isn't found 2. Add a restriction to my solrindex import that only grabs documents matching a certain query The last one appeals to me in more ways than one, since my nutch index isn't the official index for the rest of my application. I could see myself applying a quick filter from nutch to solr, only giving solr what I want. But that means I waste time crawling stuff I don't need. Is there a way to accomplish this, especially option #2? Thanks!

