On 6/28/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
I am choosing to use NUTCH-444 for my RSS functionality. Doğacan commented on
how to do this; he wrote:
...if you need the functionality of NUTCH-444, I would suggest
trying a nightly version of Nutch. Becase NUTCH-444 by
Hi,
Currently,I am running nutch in a single Linux box with 1G memory and one
3GHZ Intel P4 CPU. The hadoop is running in local mode.Now I am trying to
reparse html pages fetched. The process is very slow,it require more than 10
days for processing nearly 20M pages. I am wondering whether the
Hello, while crawling a large batch of documents i encountered a problem
with ooParser. It wouldn't be a big deal, however after that Fetcher2
stopped fetching completely so it looks like i'll have to kill it, which is
a waste of 800 000 fetched documents... Guess i'll have to fetch in smaller
Hi,
Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org
Thanks
Rob
On 6/28/07, Robert Young [EMAIL PROTECTED] wrote:
Hi,
Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org
There is some sort of stemming for German and French languages
(available as plugin analysis-de and analysis-fr). I don't know how
Hi Guys,
I have a cluster of 2 machine : Linux; Java 1.6
I started a crawl on a list of few website only. I used the command
bin/nutch crawl urls/site1 -dir crawld -depth 10 -topN 10 -threads 30
I had an error on my 6th depth.
CrawlDb update: starting
CrawlDb update: db: crawld/crawldb
Hi,
We are currently running a Tomcat web application serving searches
over our Lucene index (10GB) on a single server machine (Dual 3GHz
CPU, 4GB RAM). Due to performance issues and to scale up to handle
more traffic/search requests, we are getting another server machine.
We are looking at two
Doğacan Güney wrote:
On 6/28/07, Robert Young [EMAIL PROTECTED] wrote:
Hi,
Are the Nutch Stemming modifications available as a patch? I can't
seem to find anything on issue.apache.org
There is some sort of stemming for German and French languages
(available as plugin analysis-de and
Is there a way to use nutch to find image source locations?
--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?
Hi Daniel,
We had a similar problem earlier. In our case, this problem was caused
because the slaves couldn't resolve the ip address of the master.
Can you try an nslookup for my.machine.com on the slaves to see if it
works? If not, you'll have to make sure your dns server can resolve the
Ok, I guess I lied.
Nutch IS capable of crawling relative URLs.
Essentially what happened is that the page I was attempting to crawl,
http://www.sf911truth.org, had more than 100 outlinks on it and the relative
URL for about.html that I was expecting to see in my crawl.log was outlink
#105.
I have tried the NUTCH-444 feed plugin to enable spidering of RSS feeds:
/nutch-2007-06-27_06-52-44/plugins/feed
(that's a recent nightly build of nutch).
When I attempt a crawl I get an IOException:
$ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2
crawl started in:
I also note that karan had the same exception trace:
Exception in thread main java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at
13 matches
Mail list logo