Indexing only PDFs

Max Lynch Thu, 24 Jun 2010 10:09:06 -0700

Hi,
I would like to crawl a list of pages but only index PDFs.  From what I
gather I can add an exclusion for all non .pdf extensions in
crawl-urlfilter.txt.


However, I would also like to apply an additional restriction, that I only
index pages that match a certain query.  In my head, this doesn't seem to be
a great way of doing things, since the documents won't be optimized for
searching until they are in nutch's lucene index (AFAIK), but could I do
either of the following?

   1. Write a plugin to do a naive full text search of each "content" field
   before it is indexed and stopping the indexing if my term isn't found
   2. Add a restriction to my solrindex import that only grabs documents
   matching a certain query

The last one appeals to me in more ways than one, since my nutch index isn't
the official index for the rest of my application.  I could see myself
applying a quick filter from nutch to solr, only giving solr what I want.
 But that means I waste time crawling stuff I don't need.

Is there a way to accomplish this, especially option #2?

Thanks!

Indexing only PDFs

Reply via email to