I'm not sure to totally understand what you meant.
1. In case you know exactly how the relative urls are translated into, you
can use urlnormalizefilter to change them in what would make more 'sense'.
2. The 2nd option, if you don't want those relative links to be included,
you can use the
I guess I STILL don't understand the topN setting. Here is what I thought it
would do:
Seed: file:myfileserver.com/share1
share1 Dir listing:
file1.pdf ... file300.pdf, dir1 ... dir20
running the following in a never ending shell script:
{generate crawl/crawldb crawl/segments -topN 1000
I think I may have figured it out.. but I don't know how to fix it :-(
I have many pdfs and html files that have relative links in them. They are
not from the originally hosted site, but are re-hosted. Nutch/Tika is
trying to prepend the relative urls in incounters with the url that
contained
Could you explain what is meant by continuously running crawl cycles?
Usually, you run a crawl with a certain depth, a max. number of cycles.
If the depth is reached the crawler stops even if there are still unfetched
URLs. If generator generates an empty fetch list in one cycle the crawler
4 matches
Mail list logo