Re: db_unfetched large number, but crawling not fetching any longer

2012-03-27 Thread remi tassing
I'm not sure to totally understand what you meant. 1. In case you know exactly how the relative urls are translated into, you can use urlnormalizefilter to change them in what would make more 'sense'. 2. The 2nd option, if you don't want those relative links to be included, you can use the

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I guess I STILL don't understand the topN setting. Here is what I thought it would do: Seed: file:myfileserver.com/share1 share1 Dir listing: file1.pdf ... file300.pdf, dir1 ... dir20 running the following in a never ending shell script: {generate crawl/crawldb crawl/segments -topN 1000

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-26 Thread webdev1977
I think I may have figured it out.. but I don't know how to fix it :-( I have many pdfs and html files that have relative links in them. They are not from the originally hosted site, but are re-hosted. Nutch/Tika is trying to prepend the relative urls in incounters with the url that contained

Re: db_unfetched large number, but crawling not fetching any longer

2012-03-23 Thread Sebastian Nagel
Could you explain what is meant by continuously running crawl cycles? Usually, you run a crawl with a certain depth, a max. number of cycles. If the depth is reached the crawler stops even if there are still unfetched URLs. If generator generates an empty fetch list in one cycle the crawler