Please look at the URL filter you define within within plugin.includes property in nutch-site.xml, if it it regex-urlfilter (which it is by default) then you will need to edit the following line to remove '?'
https://github.com/apache/nutch/blob/trunk/conf/regex-urlfilter.txt.template#L33 Hopefully this makes better sense. Lewis On Thursday, March 5, 2015, Gaplan <[email protected]> wrote: > thans for answer Lewis. > i can't understand this. > "Also please ensure that your urlfilter permits '?' In URLS entries" > how can i do that ? > > On Thu, Mar 5, 2015 at 10:17 PM, Lewis John Mcgibbney < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote: > >> Hi, >> Please see >> >> http://wiki.apache.org/nutch/FAQ#Nutch_doesn.27t_crawl_relative_URLs.3F_Some_pages_are_not_indexed_but_my_regex_file_and_everything_else_is_okay_-_what_is_going_on.3F >> >> Also please ensure that your urlfilter permits '?' In URLS entries >> Hth >> Lewis >> >> On Thursday, March 5, 2015, Gaplan <[email protected]> wrote: >> >>> can you help me ? >>> >>> i have to crawl domain http://www.kadinlarkulubu.com/forum/index.php >>> but in links always >>> a href = index.php?blabla not a href= " >>> http://www.kadinlarkulubu.com/forum/index.php?blabla" >>> how can i configured this ? >>> thank you for your time.. >>> OSA >>> >> >> >> -- >> *Lewis* >> >> > -- *Lewis*

