Hi,
I had a similair problem and installed a squid-proxy-server. The squid
has the ability to limit the bandwidth and the integration in nutch was
pretty simple (just to enter a proxy). Further more there is an other
place to block the crawling of special websites.
If needed, I can assist
Hi I have setup a nutch (0.7.1) system running on multiple servers
following Stefan Groschupf tutorial
(http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever).
I already had a nutch index and a set of segments so I copied some
segments to different servers.
No I want to add
Hi, I just discovered nutch and I'm wondering if there is a tutorial
that can get me started with integrating nutch in a java project.
Searching the web I can only find docs about the command line tools!
I tried to browse the APIs, but they are quite big and I don't really
know where to start
Actually the process would be to generate your new
segments, move the segments to your newer/faster
server, fetch those segments and then copy those
segments to your webdb and run updatedb there.
you could also index your segments on the faster
server. The only process that needs webdb is the
there are a number of linux packages for QOS/traffic shaping, my favorite is
wondershaper, I havent set it up since the 2.4 kernel but it works well,
also if your not inclined to do something that involved, your isp can give
that machine's ip address a car statement in your/their cisco router
Just to add my 2 cents, for the most part if you have
a decent nic card you could issue OS commands to drop
the port rate of your interface to 10mbit and not
waste cpu cycles on shaping/proxying.
Although i do recommend squid for this since i too use
it to further filter/offload regex/hostname
Fuad Efendi wrote:
For ISPs around-the-world, thew most important thing is the Number of Active
TCP Sessions.
This is completely false. Having worked for an ISP I can assure you that
the most important metric is the amount of traffic, and its behavior
over time. TCP sessions? We don't
I agree its the simplest solution that is the best, drop the network card
speed back to 10mbit half duplex, then you wont be using all of your isp's
10mbit.
-J
- Original Message -
From: Byron Miller [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Tuesday, January 17, 2006 10:14
Hi everyone,
I want to know how we can skip some websites while indexing
that i don;t want to be indexed?
i am using language identifier plugin for Marathi language, If i
found that page is non-marathi i want to skip it.
Regards,
Sameer Tamsekar
Hi,
I am using nutch 0.8 and map-red over 3 machines, and I am getting so many
unfetched urls. The unfectched URLs are supposed to be fetched on the next
round of crawling. but they are not. I started from 8 urls and after 3
cycles here is the statistics:
060117 115357 Statistics for
Hi,
I believe I have a somewhat unique problem ( but hopefully not ).
I ran nutch against our regular site and the crawl ended up fine and it's
all working well.
My next step is to get some of our forums into the search
database so they too can be searched.
Doing a crawl is incredibly time
Gang,
I'm trying to bring up a MapReduce system, but am confused about how
to control the logging level. It seems like most of the Nutch code is
still logging the way it used to, but the -logLevel parameter that
was getting passed to each tool's main() method no longer exists (not
that these
Ken Krugler wrote:
Hello fellow Nutchers,
I followed the steps described here by Doug:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
PROTECTED]
...to start a test run of the new (0.8, as of 1/12/2006) version of
Nutch.
It ran for quite a while on my
Hi,
We have a serious problem in fetching pages. This exception blocks fetching
all pages. This error happens in datanode log file using 3 machines and
mapReduce.
060116 221332 194 DataXCeiver
java.net.SocketTimeoutException: Read timed out
at
If I remember correctly, crawl looks for a directory that contains the
urls file.
On Tue, 2006-01-17 at 11:36 -0800, Chris Shepard wrote:
Hi all,
Having some problems getting nutch to run on
XP/Cygwin.
This is re nutch-2006-01-17
Intranet crawl
When I do this (after making
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing. I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master). I'm puzzled. Currently
The urls file is in both the root (nutch-nightly)
directory _and_ (a copy) in the bin dir. It bombed
earlier on account of that. This is (I presume) a
different problem.
--- Gal Nitzan [EMAIL PROTECTED] wrote:
If I remember correctly, crawl looks for a directory
that contains the
urls
How to make Nutch use more RAM? I've added -Xmx3g to the Tomcat command line,
but Nutch does not seems to use this memory. It does not go above 170MB. No
matter whether I reserve 300MB or 3GB of RAM for JVM, search works with similar
speed and requires similar amount of HDD reads. My test DB
Hi Florent,
[snip]
1. Any ideas what might have caused it to time out just now, when it
had successfully run many jobs up to that point?
2. What cruft might I need to get rid of because it died? For example,
I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml
now
Gal,
Thank you very much! It's crawling as we speak.
I went back to check the tutorial, and it
seems to be in error. Do you have someone in
charge of that page that I should notify about
the omission?
Again, thanks a lot! :)
--- Gal Nitzan [EMAIL PROTECTED] wrote:
assuming
Happy it worked for you. You may change the file your self in the
wiki...
Happy Crawling. :)
On Tue, 2006-01-17 at 16:18 -0800, Chris Shepard wrote:
Gal,
Thank you very much! It's crawling as we speak.
I went back to check the tutorial, and it
seems to be in error. Do you have
Please be very thorough when confirming that the tutorial
has an error. I didn't experience the error that Chris
reports.
I have my url-file in the directory above /bin and when
the command is:
$ bin/nutch crawl url-file.txt
and everything works for me with no errors.
Hi guys,
Please allow me to attempt to clarify this.
I was working with 0.7.release (whatever the link
sends one to) when a direct call to the urls
file worked as well. However after downloading
the latest nightly, I got the error posted, hence
I ass_u_med it was due to some other problem.
I made small assumption/mistake in a previous post. Not all of you are using
Transport-Layer-Routers (aka Firewalls, or layer-4-Router)
But, small in-house companies are almost always using SHDSL etc., IP over
ATM, IP over Frame Relay, ...
Hardware between Crawler and Web-Site always has
24 matches
Mail list logo