Re: How to fetch URLs with special charaters '?' '='

2009-11-16 Thread Subhojit Roy
By adding the expression +[?=] just below the line that contains -...@], the URL's you mentioned are crawled. You could try that and see if it succeeds for you. -sroy On Wed, Nov 4, 2009 at 8:36 PM, saravan.krish saravanan-2.krishnamoorth...@cognizant.com wrote: I am trying to crawl the URL:

Re: Nutch near future - strategic directions

2009-11-16 Thread Andrzej Bialecki
Subhojit Roy wrote: Hi, Would it be possible to include in Nutch, the ability to crawl download a page only if the page has been updated since the last crawl? I had read sometime back that there were plans to include such a feature. It would be a very useful feature to have IMO. This of course

Re: crawling / data aggregation - is nutch the right tool?

2009-11-16 Thread no spam
For now I only need to crawl hundreds of pages, previously I wrote stuff from scratch in perl. I want something that allows me to get started quickly and allows for scale in the future. I like that Droids is a framework and I only have to do minimal work to get started. Apache-Tika is the

Re: Nutch near future - strategic directions

2009-11-16 Thread David M. Cole
At 2:44 PM +0100 11/16/09, Andrzej Bialecki wrote: This is already implemented - see the Signature / MD5Signature / TextProfileSignature. OK, then could somebody explain how to implement this feature? Does the initial indexing require a special commmand-line? Then does the secondary indexing

decoding nutch readseg -dump 's output

2009-11-16 Thread Yves Petinot
Hi, I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit of experimentation,

Scalability for one site

2009-11-16 Thread Mark Kerzner
Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? I know that URLs from one domain as assigned to one fetch segment, and

Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Andrzej Bialecki
Yves Petinot wrote: Hi, I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what encoding the raw page was in) and outputs that decoded content. After a little bit

Re: Scalability for one site

2009-11-16 Thread Alex McLintock
2009/11/16 Mark Kerzner markkerz...@gmail.com: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Nutch basically uses

Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
Alex, Thank you for the answer. As for your last question - no, I don't own that site. I am looking for specific information type, and that is the first site I want to crawl. Mark On Mon, Nov 16, 2009 at 1:54 PM, Alex McLintock alex.mclint...@gmail.comwrote: 2009/11/16 Mark Kerzner

Re: Scalability for one site

2009-11-16 Thread Andrzej Bialecki
Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and can I coordinate the crawlers so as not to cause a DOS attack? Your Hadoop cluster does not increase the

Re: Scalability for one site

2009-11-16 Thread Mark Kerzner
ROFL Thank you very much, Andrzej On Mon, Nov 16, 2009 at 2:07 PM, Andrzej Bialecki a...@getopt.org wrote: Mark Kerzner wrote: Hi, I want to politely crawl a site with 1-2 million pages. With the speed of about 1-2 seconds per fetch, it will take weeks. Can I run Nutch on Hadoop, and

Re: at the end of fetching, hung threads

2009-11-16 Thread MilleBii
Just apply the following patch. https://issues.apache.org/jira/browse/NUTCH-721 2009/11/15 MilleBii mille...@gmail.com Yes had it in the past and one needs to apply a certain patch... but I don't remember which one from the top of my head, search the mailing list. 2009/11/15 Kalaimathan

Re: decoding nutch readseg -dump 's output

2009-11-16 Thread Yves Petinot
Thanks a lot, Andrzej, this makes perfect sense. -y Andrzej Bialecki wrote: Yves Petinot wrote: Hi, I'm trying to build a small perl (could be any scripting language) utility that takes nutch readseg -dump 's output as its input, decodes the content field to utf-8 (independent of what

Re: crawling / data aggregation - is nutch the right tool?

2009-11-16 Thread Subhojit Roy
Apache-Tika is integrated with Nutch. All you need to do is to specify the formats that (are supported by Tika Nutch) and you would like to index, in the configuration file nutch-site.xml under plugin.includes (ex: parse-pdf). I have used that to extract text from PDF, doc files etc. It works