The URLFilterChecker tool doesn't have a page yet...what is the syntax
& parameters for it?

-- Chris



On Mon, Dec 19, 2011 at 2:33 PM, Markus Jelsma
<[email protected]> wrote:
>
>> On Mon, Dec 19, 2011 at 2:17 PM, Markus Jelsma
>>
>> <[email protected]> wrote:
>> > What do you mean by skipping over? You don't want ppt pptx and things? In
>> > all cases you need to set up URL filters specific for your scenario and
>> > whishes.
>>
>> I want to index all the office type documents, they're getting skipped
>> over and I don't know why.
>>
>> I have altered the regex-urlfilter.xml to NOT remove those, but
>> they're still not getting crawled.
>
> You need to check all filters that are enabled through your plugin.includes.
> There's a org.apache.nutch.net.URLFilterChecker tool. It works a bit strange
> with with -allCombined switch you can make sure it passes your URL's or not.
>
>>
>> Thanks!
>>
>> -- Chris

Reply via email to