The URLFilterChecker tool doesn't have a page yet...what is the syntax & parameters for it?
-- Chris On Mon, Dec 19, 2011 at 2:33 PM, Markus Jelsma <[email protected]> wrote: > >> On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma >> >> <[email protected]> wrote: >> > What do you mean by skipping over? You don't want ppt pptx and things? In >> > all cases you need to set up URL filters specific for your scenario and >> > whishes. >> >> I want to index all the office type documents, they're getting skipped >> over and I don't know why. >> >> I have altered the regex-urlfilter.xml to NOT remove those, but >> they're still not getting crawled. > > You need to check all filters that are enabled through your plugin.includes. > There's a org.apache.nutch.net.URLFilterChecker tool. It works a bit strange > with with -allCombined switch you can make sure it passes your URL's or not. > >> >> Thanks! >> >> -- Chris

