Thanks ! I guess you mean:
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
In conf/regex-urlfilter.txt, am I wrong ?
The DomContentUtils on
/nutch/trunk/src/java/org/apache/nutch/parse/*.java is a bit confusing
to me and cannot see the
Regarding real world nutch clusters (10 nodes) what's the approach
you follow to maximise fetches throughput ?
For instance, my guess is that the classical number-crunching (HPC)
scientific network cluster topology (intra-cluster private network
plus 1 head node with outside world connection),
Is there a simple way to have nutch index a folder full of other folders and
html files?
I was hoping to avoid having to run apache to serve the html files, and then
have nutch crawl the site on apache.
Thank you,
-Ryan
Ryan,
You can generate a file of FILE urls (eg)
file:///x/y/z/file1.html
file:///x/y/z/file2.html
Use find and AWK accordingly to generate this. put it in the url
directory and just set depth to 1, and change crawl_urlfilter.txt to
admit file:///x/y/z/ (note, if you dont head qualify it,
I dont know if this is the right place but... if not, sry.
ike the title says i need to be able to deduce web crawler behavior from the
access log.
In particular, i need to understand what this means:
xx.xx.xx.x - - [12/Jun/2008:21:10:31 +0100] GET /phpmyadmin/main.php
HTTP/1.0 404 1123 - -
Not a right place to ask :)
Basically this is http web scan to find weak web holes. And yes, this is an
attack.
Check http://packetstormsecurity.org/ and http://www.milw0rm.com/
Peace,
Kunth
On Fri, Jul 4, 2008 at 2:18 AM, ps1c5o [EMAIL PROTECTED] wrote:
I dont know if this is the right
Hi, I checked out the nutch/hadoop source code, and set their configuration
file under /conf/ folder to build path.
IDE: Intellij 7.02. OS: Ubuntu8.04.
Can be any number of reasons.
- disabled by robots.txt, this probably most common.
- session controlled.
- authentication.
On Wed, 2008-07-02 at 10:32 -0400, Bozhao Tan wrote:
Hello, I do not know why Nutch can not crawl anying from some internet
sites?
Has anyone met this problem?
Thanks!