Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Andrzej Bialecki
Ken Krugler wrote: 1. We needed to modify the commons-httpclient code to fix one hang that sometimes occurs in [...] So the question here is what to do with these changes. I will try to get them integrated into the commons-httpclient code, but that might take a while before they circle back

What is suitable environment?

2005-11-09 Thread KAAS INFOTECH
Hi All, I am new to nutch. I have downloaded latest nutch-0.7.1. I have Microsoft window install on my PC with Java home Set. I came to know that cgywin is require to run nutch? Why Cannot I run nutch from windows? If so Do I need to change to Linux(any flavor of unix) ? How can we create test e

RE: Distributed nutch

2005-11-09 Thread Rozina Sorathia
Thanx for the explanation :) -Original Message- From: Paul Baclace [mailto:[EMAIL PROTECTED] Sent: Thursday, November 10, 2005 5:18 AM To: nutch-dev@lucene.apache.org Subject: Re: Distributed nutch In addition to Stefan Groschupf's detailed references, here are some short, high-level ans

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Ken Krugler
I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is difficult

RE: protocol-http versus protocol-httpclient

2005-11-09 Thread Fuad Efendi
Doug Cutting wrote: >... protocol-http is capable of faster crawling than protocol-httpclient. > So I don't think we should discard protocol-http just yet. >What do others think? I think: HttpClient-based [protocol-httpclient] uses own Threads. [protocol-http] does not create Threads. We shou

[jira] Commented: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Fuad Efendi (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-124?page=comments#action_12357186 ] Fuad Efendi commented on NUTCH-124: --- Is such behavior defined in Robots Exclusion Protocol? http://www.robotstxt.org/ If so, it should be some kind of a new field in robots.

Re: Index update and Google Dance

2005-11-09 Thread Byron Miller
Doug, I love you hehehe :) Great vision for how things could work! --- Doug Cutting <[EMAIL PROTECTED]> wrote: > > In the future I would like to implement a more > automated distributed > search system than Nutch currently has. One way to > do this might be to > use MapReduce. Each map

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Doug On 11/10/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > Below is google architecture in my brain: > > > > DataNode A > > Master DataNode B GoogleCrawler > > DataNode C > > .. > > GoogleCrawler is kep

Re: Distributed nutch

2005-11-09 Thread Paul Baclace
In addition to Stefan Groschupf's detailed references, here are some short, high-level answers to your questions: Rozina Sorathia wrote: > 1. What is Distributed nutch Nutch is a distributed Lucene with large scale web crawling. >2. How nutch distributed works? Modeled after Google's Map-R

[jira] Commented: (NUTCH-36) Chinese in Nutch

2005-11-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_12357135 ] Andrzej Bialecki commented on NUTCH-36: Jack, Have you tested the latest patches attached to this issue + your fix for summarizer? I can test that technically speaking

[jira] Closed: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-11-09 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] Andrzej Bialecki closed NUTCH-109: --- Resolution: Invalid Proposed improvement is not real, and comes from different config. settings. Proposed implementation uses a component with incomp

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Matt Kangas
+1 I've been planning to switch my crawler over to use protocol- httpclient, but haven't got there yet. Interesting that there seems to be a performance impact with the new plugin. (In my crawl setup, I override the default HTTP plugin so I can modify HTML content before it is written to a

Re: Lucene or Nutch

2005-11-09 Thread Andrzej Bialecki
Doug Cutting wrote: Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it

Re: protocol-http versus protocol-httpclient

2005-11-09 Thread Andrzej Bialecki
Doug Cutting wrote: I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between th

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Doug Cutting
Massimo Miccoli wrote: Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back to

Re: probem with inject url to db using ndfs

2005-11-09 Thread Doug Cutting
Arsen Popovyan wrote: I start namenode, datenode, jobtracker, tasktracker. And when I try start commands: 1) echo http://cnn.com/ > ./urldir/urls 2) bin/nutch ndfs -put ./urldir /urldir 3) bin/nutch inject /db -urlfile /urldir/urls on last command I get error: Exception in thread "main" ja

Re: Index update and Google Dance

2005-11-09 Thread Doug Cutting
Jack Tang wrote: Below is google architecture in my brain: DataNode A Master DataNode B GoogleCrawler DataNode C .. GoogleCrawler is kept running all the time. One day, it gets fethlist from DataNode A, crawls all pages and i

Re: Lucene or Nutch

2005-11-09 Thread Doug Cutting
Jérôme Charron wrote: In fact, I think it could be a good idea to move the nutch language identifier core code to a standalone library or to lucene code. Does it make sense? What do you think about it? What is the best solution (standalone vs lucene)? One could put it in the lucene contrib dire

Re: mapred bug -- bad part calculation?

2005-11-09 Thread Paul Baclace
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the "No input directories" issue when using a local filesystem with multiple task tracke

Request for info regarding filesystem based index.

2005-11-09 Thread Mike Reynols
Here's the problem: I need to get the Nutch engine running on a collection of xml documents that I have (containing news stories). The files are named in the following manner: example.xml.52908 example.xml.52909 example.xml.52910 example.xml.52911 ... example.xml.53365 example.xml.53366 Each

protocol-http versus protocol-httpclient

2005-11-09 Thread Doug Cutting
I was recently benchmarking fetching at a site with lots of bandwidth, and it seemed to me that protocol-http is capable of faster crawling than protocol-httpclient. So I don't think we should discard protocol-http just yet. But there's a lot of duplicate code between these, which is difficul

RE: Lucene or Nutch

2005-11-09 Thread Rajan, Renuka
Hello All My question is kind of related to the email below. I was exploring the option to full-text index a fairly large database that's 40G in size (data alone minus indices etc). This data resides in Oracle which has its own full text indexing engine. Does anyone have a recommendation bet

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
> Yes, Lucene is the best fit for what you're after. Nutch is built on > Lucene, and adds web crawling on top. You don't need a web crawler, > so using Lucene directly is the best fit - of course you'll have to > write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

Re: Lucene or Nutch

2005-11-09 Thread Erik Hatcher
Yes, Lucene is the best fit for what you're after. Nutch is built on Lucene, and adds web crawling on top. You don't need a web crawler, so using Lucene directly is the best fit - of course you'll have to write code to integrate Lucene. Erik On 9 Nov 2005, at 08:48, Klaus wrote: H

Lucene or Nutch

2005-11-09 Thread Klaus
Hello, my name is Klaus and I'm a new member in this mailing list. I'm currently working on my master thesis. One of my tasks is to implement a full text search into an existing information system. Browsing the web, I found lucene and nutch. Unfortunately I'm not sure which of these tools fits

Re: [Nutch-dev] [jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-09 Thread Massimo Miccoli
Ther's a problem with that solution. The protocol-httpclient now , for some site, gerate a SEVERE Narrowly avoided an infinite loop in execute So the fetcher exit ands only some pages is fetched until the SEVERE message. I don't know a solution, for now I switch back to protocoll-http. Doug

RE: Distributed nutch

2005-11-09 Thread Rozina Sorathia
Ok ...will take a note of it... Thanx for the reply :) -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 09, 2005 5:50 PM To: nutch-dev@lucene.apache.org; nutch-user@lucene.apache.org Subject: Re: Distributed nutch Please do not cross post to t

Re: Distributed nutch

2005-11-09 Thread Stefan Groschupf
Please do not cross post to the user and developer list! Nutch use map reduce as distribution mechanism. see: http://wiki.apache.org/nutch/Presentations mapred.pdf: "MapReduce in Nutch", 20 June 2005, Yahoo!, Sunnyvale, CA, USA oscon05.pdf: "Scalable Computing with MapReduce", 3 August 2005,

Distributed nutch

2005-11-09 Thread Rozina Sorathia
    I have following queries..Can anyone explain this or tell me where I will find the detailed explanation on this: 1. What is Distributed nutch 2. How nutch distributed works? 3. When we say distributed, what is distributed? 4. When one server goes down, what happens?  

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Thanks for your explaination, Andrzej. I am going to read some NFS source codes and ask smarter questions later. Thanks again. Regards /Jack On 11/9/05, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Jack Tang wrote: > > >Hi Andrzej > > > >In document, Michael said: > >"I'd strongly recommend usin

Re: Index update and Google Dance

2005-11-09 Thread Stefan Groschupf
and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is idle. How does NFS do in this case? Actually you have to do that manually, but there will be a automatically solution later. Or could you tell me where should I start learning? The nut

Re: Index update and Google Dance

2005-11-09 Thread Andrzej Bialecki
Jack Tang wrote: Hi Andrzej In document, Michael said: "I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using "ndfs.replication" property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.jav

Re: Index update and Google Dance

2005-11-09 Thread Jack Tang
Hi Andrzej In document, Michael said: "I'd strongly recommend using the system with a replication rate of 3 copies, 2 minimum. Desired replication can be set in nutch config file using "ndfs.replication" property, and MIN_REPLICATION constant is located in ndfs/FSNamesystem.java (and set to 1 by d