segslice usually performs 200-300 records/sec on my machine (quite fast
for everything else, top of the line).
Is it just copying the segments minus the last part or some processing
is required for each record?
Any advise how can it be optimized?
202443 Pages consumed: 13 (at index 13). Links fetched: 233386.
202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/].
202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315.
If there is maxoutlinks already specified in the xml config, why does
nutch bother
will run over any website, with 50-500 threads the default three
retry times, and the problem will solve itself out. But, can something
be done for the rest or us, please?
A simple max-host-pages500/max-host-pages would really be appreciated..
Regards,
EM
Transbuerg Tian wrote:
I have the same
this, and instead of manually
typing regex to clean them off (which takes time) I'd strongly prefer an
automated solution if possible.
Regards,
EM
of whoever designed the page right?
It the page tries to be smart and 'determine' what user want's to see,
well, if you don't own that webpage there isn't anything you can do.
Best regards,
EM
+= myTranslationFunctionToLatin(content);
doc.add (Field.Text(content, content);
Or would the last line be:
doc.add(Field.UnStored(content, content));
What's the difference with regard to the Field.* object?
Regards,
EM
Should the following be happening?
Short description:
-fetch bunch of pages.
-status: 5400 fetched, 27 errors
-fetch 22 more pages
-status: 5403 fetched, 27 errors
My regex-urlfilter excludes jpg and includes ?
Long description:
050809 093001 fetching http://domain/text.jpg?6351
050809 093001
Suppose we have to fetch 3 pages.
Page A is http://something/login.php
Page B is http://yyy/rrr/ which, when fetched, redirects to page A
Page C is http://yyy/ttt/ which, when fetched, redirects to page A
When fetching A, B, C the fetcher will fetch
A
B
A
C
A
Is there any way to prevent the
?
Keep up the good work,
EM in Toronto.