Ken Krugler wrote:
1. We needed to modify the commons-httpclient code to fix one hang
that sometimes occurs in
[...]
So the question here is what to do with these changes. I will try to
get them integrated into the commons-httpclient code, but that might
take a while before they circle back
Hi All,
I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft
window install on my PC with Java home Set. I came to know that cgywin is
require to run nutch? Why Cannot I run nutch from windows? If so Do I need
to change to Linux(any flavor of unix) ?
How can we create test e
Thanx for the explanation :)
-Original Message-
From: Paul Baclace [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 10, 2005 5:18 AM
To: nutch-dev@lucene.apache.org
Subject: Re: Distributed nutch
In addition to Stefan Groschupf's detailed references, here are some
short, high-level ans
I was recently benchmarking fetching at a site with lots of
bandwidth, and it seemed to me that protocol-http is capable of
faster crawling than protocol-httpclient. So I don't think we should
discard protocol-http just yet. But there's a lot of duplicate code
between these, which is difficult
Doug Cutting wrote:
>... protocol-http is capable of faster crawling than protocol-httpclient.
> So I don't think we should discard protocol-http just yet.
>What do others think?
I think:
HttpClient-based [protocol-httpclient] uses own Threads.
[protocol-http] does not create Threads.
We shou
[
http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ]
Fuad Efendi commented on NUTCH-124:
---
Is such behavior defined in Robots Exclusion Protocol?
http://www.robotstxt.org/ If so, it should be some kind of a new field in
robots.
Doug, I love you
hehehe :)
Great vision for how things could work!
--- Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> In the future I would like to implement a more
> automated distributed
> search system than Nutch currently has. One way to
> do this might be to
> use MapReduce. Each map
Hi Doug
On 11/10/05, Doug Cutting <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
> > Below is google architecture in my brain:
> >
> > DataNode A
> > Master DataNode B GoogleCrawler
> > DataNode C
> > ..
> > GoogleCrawler is kep
In addition to Stefan Groschupf's detailed references, here are some short,
high-level answers to your questions:
Rozina Sorathia wrote:
> 1. What is Distributed nutch
Nutch is a distributed Lucene with large scale web crawling.
>2. How nutch distributed works?
Modeled after Google's Map-R
[
http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12357135 ]
Andrzej Bialecki commented on NUTCH-36:
Jack,
Have you tested the latest patches attached to this issue + your fix for
summarizer? I can test that technically speaking
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ]
Andrzej Bialecki closed NUTCH-109:
---
Resolution: Invalid
Proposed improvement is not real, and comes from different config. settings.
Proposed implementation uses a component with incomp
+1
I've been planning to switch my crawler over to use protocol-
httpclient, but haven't got there yet. Interesting that there seems
to be a performance impact with the new plugin.
(In my crawl setup, I override the default HTTP plugin so I can
modify HTML content before it is written to a
Doug Cutting wrote:
Jérôme Charron wrote:
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best
solution
(standalone vs lucene)?
One could put it
Doug Cutting wrote:
I was recently benchmarking fetching at a site with lots of bandwidth,
and it seemed to me that protocol-http is capable of faster crawling
than protocol-httpclient. So I don't think we should discard
protocol-http just yet. But there's a lot of duplicate code between
th
Massimo Miccoli wrote:
Ther's a problem with that solution. The protocol-httpclient now , for
some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE
message.
I don't know a solution, for now I switch back to
Arsen Popovyan wrote:
I start namenode, datenode, jobtracker, tasktracker. And when I try
start commands:
1) echo http://cnn.com/ > ./urldir/urls
2) bin/nutch ndfs -put ./urldir /urldir
3) bin/nutch inject /db -urlfile /urldir/urls
on last command I get error:
Exception in thread "main" ja
Jack Tang wrote:
Below is google architecture in my brain:
DataNode A
Master DataNode B GoogleCrawler
DataNode C
..
GoogleCrawler is kept running all the time. One day, it gets fethlist
from DataNode A, crawls all pages and i
Jérôme Charron wrote:
In fact, I think it could be a good idea to move the nutch language
identifier core code
to a standalone library or to lucene code.
Does it make sense? What do you think about it? What is the best solution
(standalone vs lucene)?
One could put it in the lucene contrib dire
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the "No input
directories" issue when using a local filesystem with multiple task
tracke
Here's the problem:
I need to get the Nutch engine running on a collection of xml documents that
I have (containing news stories). The files are named in the following
manner:
example.xml.52908
example.xml.52909
example.xml.52910
example.xml.52911
...
example.xml.53365
example.xml.53366
Each
I was recently benchmarking fetching at a site with lots of bandwidth,
and it seemed to me that protocol-http is capable of faster crawling
than protocol-httpclient. So I don't think we should discard
protocol-http just yet. But there's a lot of duplicate code between
these, which is difficul
Hello All
My question is kind of related to the email below. I was exploring the option
to full-text index a fairly large database that's 40G in size (data alone minus
indices etc). This data resides in Oracle which has its own full text indexing
engine. Does anyone have a recommendation bet
> Yes, Lucene is the best fit for what you're after. Nutch is built on
> Lucene, and adds web crawling on top. You don't need a web crawler,
> so using Lucene directly is the best fit - of course you'll have to
> write code to integrate Lucene.
Erik,
I was thinking about it for a while, but don't
Yes, Lucene is the best fit for what you're after. Nutch is built on
Lucene, and adds web crawling on top. You don't need a web crawler,
so using Lucene directly is the best fit - of course you'll have to
write code to integrate Lucene.
Erik
On 9 Nov 2005, at 08:48, Klaus wrote:
H
Hello,
my name is Klaus and I'm a new member in this mailing list. I'm currently
working on my master thesis. One of my tasks is to implement a full text
search into an existing information system. Browsing the web, I found lucene
and nutch. Unfortunately I'm not sure which of these tools fits
Ther's a problem with that solution. The protocol-httpclient now , for
some site, gerate a SEVERE Narrowly avoided an infinite loop in execute
So the fetcher exit ands only some pages is fetched until the SEVERE
message.
I don't know a solution, for now I switch back to protocoll-http.
Doug
Ok ...will take a note of it...
Thanx for the reply :)
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 09, 2005 5:50 PM
To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org
Subject: Re: Distributed nutch
Please do not cross post to t
Please do not cross post to the user and developer list!
Nutch use map reduce as distribution mechanism.
see: http://wiki.apache.org/nutch/Presentations
mapred.pdf: "MapReduce in Nutch", 20 June 2005, Yahoo!, Sunnyvale,
CA, USA
oscon05.pdf: "Scalable Computing with MapReduce", 3 August 2005,
I have following queries..Can anyone
explain this or tell me where I will find the detailed explanation on this:
1. What is Distributed nutch
2. How nutch distributed works?
3. When we say distributed, what is
distributed?
4. When one server goes down, what
happens?
Thanks for your explaination, Andrzej.
I am going to read some NFS source codes and ask smarter questions later.
Thanks again.
Regards
/Jack
On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Jack Tang wrote:
>
> >Hi Andrzej
> >
> >In document, Michael said:
> >"I'd strongly recommend usin
and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this
case?
Actually you have to do that manually, but there will be a
automatically solution later.
Or could you tell me where should I start learning?
The nut
Jack Tang wrote:
Hi Andrzej
In document, Michael said:
"I'd strongly recommend using the system with a replication rate of 3
copies, 2 minimum. Desired replication can be set in nutch config file
using "ndfs.replication" property, and MIN_REPLICATION constant is
located in ndfs/FSNamesystem.jav
Hi Andrzej
In document, Michael said:
"I'd strongly recommend using the system with a replication rate of 3
copies, 2 minimum. Desired replication can be set in nutch config file
using "ndfs.replication" property, and MIN_REPLICATION constant is
located in ndfs/FSNamesystem.java (and set to 1 by d
33 matches
Mail list logo