Re: throttling bandwidth

2006-01-17 Thread Michael Nebel
Hi, I had a similair problem and installed a squid-proxy-server. The squid has the ability to limit the bandwidth and the integration in nutch was pretty simple (just to enter a proxy). Further more there is an other place to block the crawling of special websites. If needed, I can assist

Nutch system running on multiple servers | fetcher

2006-01-17 Thread Håvard W. Kongsgård
Hi I have setup a nutch (0.7.1) system running on multiple servers following Stefan Groschupf tutorial (http://wiki.media-style.com/display/nutchDocu/setup+multiple+search+sever). I already had a nutch index and a set of segments so I copied some segments to different servers. No I want to add

nutch integration

2006-01-17 Thread Ennio Tosi
Hi, I just discovered nutch and I'm wondering if there is a tutorial that can get me started with integrating nutch in a java project. Searching the web I can only find docs about the command line tools! I tried to browse the APIs, but they are quite big and I don't really know where to start

Re: Nutch system running on multiple servers | fetcher

2006-01-17 Thread Byron Miller
Actually the process would be to generate your new segments, move the segments to your newer/faster server, fetch those segments and then copy those segments to your webdb and run updatedb there. you could also index your segments on the faster server. The only process that needs webdb is the

Re: throttling bandwidth

2006-01-17 Thread Jay Pound
there are a number of linux packages for QOS/traffic shaping, my favorite is wondershaper, I havent set it up since the 2.4 kernel but it works well, also if your not inclined to do something that involved, your isp can give that machine's ip address a car statement in your/their cisco router

Re: throttling bandwidth

2006-01-17 Thread Byron Miller
Just to add my 2 cents, for the most part if you have a decent nic card you could issue OS commands to drop the port rate of your interface to 10mbit and not waste cpu cycles on shaping/proxying. Although i do recommend squid for this since i too use it to further filter/offload regex/hostname

Re: throttling bandwidth

2006-01-17 Thread Andrzej Bialecki
Fuad Efendi wrote: For ISPs around-the-world, thew most important thing is the Number of Active TCP Sessions. This is completely false. Having worked for an ISP I can assure you that the most important metric is the amount of traffic, and its behavior over time. TCP sessions? We don't

Re: throttling bandwidth

2006-01-17 Thread Jay Pound
I agree its the simplest solution that is the best, drop the network card speed back to 10mbit half duplex, then you wont be using all of your isp's 10mbit. -J - Original Message - From: Byron Miller [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Tuesday, January 17, 2006 10:14

About skipping some sites

2006-01-17 Thread Sameer Tamsekar
Hi everyone, I want to know how we can skip some websites while indexing that i don;t want to be indexed? i am using language identifier plugin for Marathi language, If i found that page is non-marathi i want to skip it. Regards, Sameer Tamsekar

so many unfetched URLs after depth 4 using mapReduce on 3 three machines

2006-01-17 Thread Mike Smith
Hi, I am using nutch 0.8 and map-red over 3 machines, and I am getting so many unfetched urls. The unfectched URLs are supposed to be fetched on the next round of crawling. but they are not. I started from 8 urls and after 3 cycles here is the statistics: 060117 115357 Statistics for

adding additional custom documents to nutch-created index

2006-01-17 Thread cilquirm . 20552126
Hi, I believe I have a somewhat unique problem ( but hopefully not ). I ran nutch against our regular site and the crawl ended up fine and it's all working well. My next step is to get some of our forums into the search database so they too can be searched. Doing a crawl is incredibly time

How do I control log level with MapReduce?

2006-01-17 Thread Chris Schneider
Gang, I'm trying to bring up a MapReduce system, but am confused about how to control the logging level. It seems like most of the Nutch code is still logging the way it used to, but the -logLevel parameter that was getting passed to each tool's main() method no longer exists (not that these

Re: Error at end of MapReduce run with indexing

2006-01-17 Thread Florent Gluck
Ken Krugler wrote: Hello fellow Nutchers, I followed the steps described here by Doug: http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL PROTECTED] ...to start a test run of the new (0.8, as of 1/12/2006) version of Nutch. It ran for quite a while on my

Problem with fetching:Fail in DataXCeiver

2006-01-17 Thread Rafit Izhak_Ratzin
Hi, We have a serious problem in fetching pages. This exception blocks fetching all pages. This error happens in datanode log file using 3 machines and mapReduce. 060116 221332 194 DataXCeiver java.net.SocketTimeoutException: Read timed out at

Re: XP/Cygwin setup problems

2006-01-17 Thread Gal Nitzan
If I remember correctly, crawl looks for a directory that contains the urls file. On Tue, 2006-01-17 at 11:36 -0800, Chris Shepard wrote: Hi all, Having some problems getting nutch to run on XP/Cygwin. This is re nutch-2006-01-17 Intranet crawl When I do this (after making

Re: So many Unfetched Pages using MapReduce

2006-01-17 Thread Florent Gluck
I'm having the exact same problem. I noticed that changing the number of map/reduce tasks gives me different DB_fetched results. Looking at the logs, a lot of urls are actually missing. I can't find their trace *anywhere* in the logs (whether on the slaves or the master). I'm puzzled. Currently

Re: XP/Cygwin setup problems

2006-01-17 Thread Chris Shepard
The urls file is in both the root (nutch-nightly) directory _and_ (a copy) in the bin dir. It bombed earlier on account of that. This is (I presume) a different problem. --- Gal Nitzan [EMAIL PROTECTED] wrote: If I remember correctly, crawl looks for a directory that contains the urls

Help pls Nutch RAM

2006-01-17 Thread Michael Sashnikov
How to make Nutch use more RAM? I've added -Xmx3g to the Tomcat command line, but Nutch does not seems to use this memory. It does not go above 170MB. No matter whether I reserve 300MB or 3GB of RAM for JVM, search works with similar speed and requires similar amount of HDD reads. My test DB

Re: Error at end of MapReduce run with indexing

2006-01-17 Thread Ken Krugler
Hi Florent, [snip] 1. Any ideas what might have caused it to time out just now, when it had successfully run many jobs up to that point? 2. What cruft might I need to get rid of because it died? For example, I see a reference to /home/crawler/tmp/local/jobTracker/job_18cunz.xml now

Re: XP/Cygwin setup problems

2006-01-17 Thread Chris Shepard
Gal, Thank you very much! It's crawling as we speak. I went back to check the tutorial, and it seems to be in error. Do you have someone in charge of that page that I should notify about the omission? Again, thanks a lot! :) --- Gal Nitzan [EMAIL PROTECTED] wrote: assuming

Re: XP/Cygwin setup problems

2006-01-17 Thread Gal Nitzan
Happy it worked for you. You may change the file your self in the wiki... Happy Crawling. :) On Tue, 2006-01-17 at 16:18 -0800, Chris Shepard wrote: Gal, Thank you very much! It's crawling as we speak. I went back to check the tutorial, and it seems to be in error. Do you have

Re: XP/Cygwin setup problems

2006-01-17 Thread dave . campbell
Please be very thorough when confirming that the tutorial has an error. I didn't experience the error that Chris reports. I have my url-file in the directory above /bin and when the command is: $ bin/nutch crawl url-file.txt and everything works for me with no errors.

Re: XP/Cygwin setup problems

2006-01-17 Thread Chris Shepard
Hi guys, Please allow me to attempt to clarify this. I was working with 0.7.release (whatever the link sends one to) when a direct call to the urls file worked as well. However after downloading the latest nightly, I got the error posted, hence I ass_u_med it was due to some other problem.

RE: throttling bandwidth

2006-01-17 Thread Fuad Efendi
I made small assumption/mistake in a previous post. Not all of you are using Transport-Layer-Routers (aka Firewalls, or layer-4-Router) But, small in-house companies are almost always using SHDSL etc., IP over ATM, IP over Frame Relay, ... Hardware between Crawler and Web-Site always has