> On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > What do you mean by skipping over? You don't want ppt pptx and things? In
> > all cases you need to set up URL filters specific for your scenario and
> > whishes.
> 
> I want to index all the office type documents, they're getting skipped
> over and I don't know why.
> 
> I have altered the regex-urlfilter.xml to NOT remove those, but
> they're still not getting crawled.

You need to check all filters that are enabled through your plugin.includes. 
There's a org.apache.nutch.net.URLFilterChecker tool. It works a bit strange 
with with -allCombined switch you can make sure it passes your URL's or not.

> 
> Thanks!
> 
> -- Chris

Reply via email to