why is segslice so slow?

2005-10-15 Thread EM
segslice usually performs 200-300 records/sec on my machine (quite fast for everything else, top of the line). Is it just copying the segments minus the last part or some processing is required for each record? Any advise how can it be optimized?

suspicious outlink count

2005-10-12 Thread EM
202443 Pages consumed: 13 (at index 13). Links fetched: 233386. 202443 Suspicious outlink count = 30442 for [http://www.dmoz.org/]. 202444 Pages consumed: 135000 (at index 135000). Links fetched: 272315. If there is maxoutlinks already specified in the xml config, why does nutch bother

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
will run over any website, with 50-500 threads the default three retry times, and the problem will solve itself out. But, can something be done for the rest or us, please? A simple max-host-pages500/max-host-pages would really be appreciated.. Regards, EM Transbuerg Tian wrote: I have the same

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
this, and instead of manually typing regex to clean them off (which takes time) I'd strongly prefer an automated solution if possible. Regards, EM

Re: Nutch Crawler, Page Rediction and Pagination

2005-09-25 Thread EM
of whoever designed the page right? It the page tries to be smart and 'determine' what user want's to see, well, if you don't own that webpage there isn't anything you can do. Best regards, EM

Field.Text vs Field.UnStored

2005-08-12 Thread EM
+= myTranslationFunctionToLatin(content); doc.add (Field.Text(content, content); Or would the last line be: doc.add(Field.UnStored(content, content)); What's the difference with regard to the Field.* object? Regards, EM

strange url counting in the fetcher

2005-08-09 Thread EM
Should the following be happening? Short description: -fetch bunch of pages. -status: 5400 fetched, 27 errors -fetch 22 more pages -status: 5403 fetched, 27 errors My regex-urlfilter excludes jpg and includes ? Long description: 050809 093001 fetching http://domain/text.jpg?6351 050809 093001

fetching redirect bug?

2005-08-05 Thread EM
Suppose we have to fetch 3 pages. Page A is http://something/login.php Page B is http://yyy/rrr/ which, when fetched, redirects to page A Page C is http://yyy/ttt/ which, when fetched, redirects to page A When fetching A, B, C the fetcher will fetch A B A C A Is there any way to prevent the

My wishlist of 12 out of...

2005-08-02 Thread EM
? Keep up the good work, EM in Toronto.