Re: need a little bit apache nutch ..

Lewis John Mcgibbney Thu, 05 Mar 2015 12:28:32 -0800

Please look at the URL filter you define within within plugin.includes
property in nutch-site.xml, if it it regex-urlfilter (which it is by
default) then you will need to edit the following line to remove '?'


https://github.com/apache/nutch/blob/trunk/conf/regex-urlfilter.txt.template#L33

Hopefully this makes better sense.
Lewis

On Thursday, March 5, 2015, Gaplan <[email protected]> wrote:

> thans for answer Lewis.
>  i can't understand this.
> "Also please ensure that your urlfilter permits '?' In URLS entries"
> how can i do that ?
>
> On Thu, Mar 5, 2015 at 10:17 PM, Lewis John Mcgibbney <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Hi,
>> Please see
>>
>> http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F
>>
>> Also please ensure that your urlfilter permits '?' In URLS entries
>> Hth
>> Lewis
>>
>> On Thursday, March 5, 2015, Gaplan <[email protected]> wrote:
>>
>>> can you help me ?
>>>
>>> i have to crawl domain http://www.kadinlarkulubu.com/forum/index.php
>>> but in links always
>>> a href  = index.php?blabla not a href= "
>>> http://www.kadinlarkulubu.com/forum/index.php?blabla";
>>> how can i configured this ?
>>> thank you for your time..
>>> OSA
>>>
>>
>>
>> --
>> *Lewis*
>>
>>
>

-- 
*Lewis*

Re: need a little bit apache nutch ..

Reply via email to