> On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma > > <[email protected]> wrote: > > What do you mean by skipping over? You don't want ppt pptx and things? In > > all cases you need to set up URL filters specific for your scenario and > > whishes. > > I want to index all the office type documents, they're getting skipped > over and I don't know why. > > I have altered the regex-urlfilter.xml to NOT remove those, but > they're still not getting crawled.
You need to check all filters that are enabled through your plugin.includes. There's a org.apache.nutch.net.URLFilterChecker tool. It works a bit strange with with -allCombined switch you can make sure it passes your URL's or not. > > Thanks! > > -- Chris

