Re: Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-06-28 Thread Doğacan Güney
On 6/28/07, Kai_testing Middleton [EMAIL PROTECTED] wrote: I am choosing to use NUTCH-444 for my RSS functionality. Doğacan commented on how to do this; he wrote: ...if you need the functionality of NUTCH-444, I would suggest trying a nightly version of Nutch. Becase NUTCH-444 by

too slow for re-parse job ..

2007-06-28 Thread qi wu
Hi, Currently,I am running nutch in a single Linux box with 1G memory and one 3GHZ Intel P4 CPU. The hadoop is running in local mode.Now I am trying to reparse html pages fetched. The process is very slow,it require more than 10 days for processing nearly 20M pages. I am wondering whether the

Problem with ooParser

2007-06-28 Thread Karol Rybak
Hello, while crawling a large batch of documents i encountered a problem with ooParser. It wouldn't be a big deal, however after that Fetcher2 stopped fetching completely so it looks like i'll have to kill it, which is a waste of 800 000 fetched documents... Guess i'll have to fetch in smaller

Stemming with Nutch

2007-06-28 Thread Robert Young
Hi, Are the Nutch Stemming modifications available as a patch? I can't seem to find anything on issue.apache.org Thanks Rob

Re: Stemming with Nutch

2007-06-28 Thread Doğacan Güney
On 6/28/07, Robert Young [EMAIL PROTECTED] wrote: Hi, Are the Nutch Stemming modifications available as a patch? I can't seem to find anything on issue.apache.org There is some sort of stemming for German and French languages (available as plugin analysis-de and analysis-fr). I don't know how

Crawl error with hadoop

2007-06-28 Thread Emmanuel JOKE
Hi Guys, I have a cluster of 2 machine : Linux; Java 1.6 I started a crawl on a list of few website only. I used the command bin/nutch crawl urls/site1 -dir crawld -depth 10 -topN 10 -threads 30 I had an error on my 6th depth. CrawlDb update: starting CrawlDb update: db: crawld/crawldb

Scaling up to several machines with Lucene

2007-06-28 Thread Chun Wei Ho
Hi, We are currently running a Tomcat web application serving searches over our Lucene index (10GB) on a single server machine (Dual 3GHz CPU, 4GB RAM). Due to performance issues and to scale up to handle more traffic/search requests, we are getting another server machine. We are looking at two

Re: Stemming with Nutch

2007-06-28 Thread Enis Soztutar
Doğacan Güney wrote: On 6/28/07, Robert Young [EMAIL PROTECTED] wrote: Hi, Are the Nutch Stemming modifications available as a patch? I can't seem to find anything on issue.apache.org There is some sort of stemming for German and French languages (available as plugin analysis-de and

Using nutch to find image links

2007-06-28 Thread bbrown
Is there a way to use nutch to find image source locations? -- Berlin Brown [berlin dot brown at gmail dot com] http://botspiritcompany.com/botlist/?

RE: hadoop-site.xml Help

2007-06-28 Thread Vishal Shah
Hi Daniel, We had a similar problem earlier. In our case, this problem was caused because the slaves couldn't resolve the ip address of the master. Can you try an nslookup for my.machine.com on the slaves to see if it works? If not, you'll have to make sure your dns server can resolve the

Re: not crawling relative URLs

2007-06-28 Thread Kai_testing Middleton
Ok, I guess I lied. Nutch IS capable of crawling relative URLs. Essentially what happened is that the page I was attempting to crawl, http://www.sf911truth.org, had more than 100 outlinks on it and the relative URL for about.html that I was expecting to see in my crawl.log was outlink #105.

IOException using feed plugin - NUTCH-444

2007-06-28 Thread Kai_testing Middleton
I have tried the NUTCH-444 feed plugin to enable spidering of RSS feeds: /nutch-2007-06-27_06-52-44/plugins/feed (that's a recent nightly build of nutch). When I attempt a crawl I get an IOException: $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 crawl started in:

Re: IOException using feed plugin - NUTCH-444

2007-06-28 Thread Kai_testing Middleton
I also note that karan had the same exception trace: Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.crawl.Injector.inject(Injector.java:162) at