Re: Nutch spider trap detection

2008-07-03 Thread brainstorm
Thanks ! I guess you mean: # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ In conf/regex-urlfilter.txt, am I wrong ? The DomContentUtils on /nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing to me and cannot see the

Preferred nutch cluster network topology ?

2008-07-03 Thread brainstorm
Regarding real world nutch clusters (10 nodes) what's the approach you follow to maximise fetches throughput ? For instance, my guess is that the classical number-crunching (HPC) scientific network cluster topology (intra-cluster private network plus 1 head node with outside world connection),

Indexing static html files

2008-07-03 Thread Ryan Smith
Is there a simple way to have nutch index a folder full of other folders and html files? I was hoping to avoid having to run apache to serve the html files, and then have nutch crawl the site on apache. Thank you, -Ryan

Re: Indexing static html files

2008-07-03 Thread Winton Davies
Ryan, You can generate a file of FILE urls (eg) file:///x/y/z/file1.html file:///x/y/z/file2.html Use find and AWK accordingly to generate this. put it in the url directory and just set depth to 1, and change crawl_urlfilter.txt to admit file:///x/y/z/ (note, if you dont head qualify it,

deducing web crawler behavior from access.log files

2008-07-03 Thread ps1c5o
I dont know if this is the right place but... if not, sry. ike the title says i need to be able to deduce web crawler behavior from the access log. In particular, i need to understand what this means: xx.xx.xx.x - - [12/Jun/2008:21:10:31 +0100] GET /phpmyadmin/main.php HTTP/1.0 404 1123 - -

Re: deducing web crawler behavior from access.log files

2008-07-03 Thread Kunthar
Not a right place to ask :) Basically this is http web scan to find weak web holes. And yes, this is an attack. Check http://packetstormsecurity.org/ and http://www.milw0rm.com/ Peace, Kunth On Fri, Jul 4, 2008 at 2:18 AM, ps1c5o [EMAIL PROTECTED] wrote: I dont know if this is the right

Re: problem running nutch from eclipse 3.2 in ubuntu hardy.

2008-07-03 Thread Hut
Hi, I checked out the nutch/hadoop source code, and set their configuration file under /conf/ folder to build path. IDE: Intellij 7.02. OS: Ubuntu8.04.

Re: Question about Nutch crawling

2008-07-03 Thread kevin chen
Can be any number of reasons. - disabled by robots.txt, this probably most common. - session controlled. - authentication. On Wed, 2008-07-02 at 10:32 -0400, Bozhao Tan wrote: Hello, I do not know why Nutch can not crawl anying from some internet sites? Has anyone met this problem? Thanks!