Authenticity of URLs from DMOZ

2009-10-06 Thread Gaurang Patel
Hey, Can anyone tell what could be the reason for following which happened while fetching data using bin/nutch fetch: My AVG Antivirus is detecting virus threats while Nutch fetches pages from available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb. Antivirus already detected

generate/fetch using multiple machines

2009-10-06 Thread Gaurang Patel
All- Idea on how to configure nutch to generate/fetch on multiple machines simultaneously? -Gaurang

generate, fetch- nutch commands

2009-10-05 Thread Gaurang Patel
, etc? Thanks Regards, Gaurang Patel

Number of urls in the crawl database.

2009-10-05 Thread Gaurang Patel
All- At any point of time, is there a way to know how many urls are there in my *crawldb *? Regards, Gaurang

Re: Incremental Whole Web Crawling

2009-10-05 Thread Gaurang Patel
Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric is running. -Gaurang 2009/10/5 Andrzej Bialecki a...@getopt.org Eric wrote: Andrzej, Just to make sure I have this straight, set the generate.update.db

Re: whole web crawl

2009-10-05 Thread Gaurang Patel
Hey Jack, *One concern:* I am not sure where can I get 0.1 billion page urls? I am using DMOZ Open Directory(which has around 3M urls) to inject the crawldb. Please help. Regards, Gaurang 2009/10/4 Jack Yu jackyu...@gmail.com 0.1 billion pages for 1.5TB On 10/5/09, Gaurang Patel

Re: Incremental Whole Web Crawling

2009-10-05 Thread Gaurang Patel
Hey, Never mind. I got *generate.update.db* in *nutch-default.xml* and set it true. Regards, Gaurang 2009/10/5 Gaurang Patel gaurangtpa...@gmail.com Hey Andrzej, Can you tell me where to set this property (generate.update.db)? I am trying to run similar kind of crawl scenario that Eric

whole web crawl

2009-10-04 Thread Gaurang Patel
All- I am novice to using Nutch. Can anyone tell me the estimated size in (I suppose, in TBs) that will be required to store the crawled results for whole web? I want to get estimate of the memory requirements for my project, that uses Nutch web crawler. Regards, Gaurang Patel

Re: whole web crawl

2009-10-04 Thread Gaurang Patel
Thanks Jack. This will help. -Gaurang 2009/10/4 Jack Yu jackyu...@gmail.com 0.1 billion pages for 1.5TB On 10/5/09, Gaurang Patel gaurangtpa...@gmail.com wrote: All- I am novice to using Nutch. Can anyone tell me the estimated size in (I suppose, in TBs) that will be required

Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
Hi All,* *Can anyone help me with this problem?* Here is my problem:* I want to get the source code of the hits I get using nutch crawler. I am not sure whether nutch stores the content of a web page(i.e actual source code for web page) in the crawled results. I am afraid if it does not! If

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
dynamically. In other words, Nutch does not store the source code into crawled results. Let me know if I am wrong. -Gaurang 2009/5/11 Susam Pal susam@gmail.com On Tue, May 12, 2009 at 8:50 AM, Gaurang Patel gaurangtpa...@gmail.com wrote: Hi All,* *Can anyone help me with this problem

Re: Content(source code) of web pages crawled by nutch

2009-05-11 Thread Gaurang Patel
for helping me out anyways. -Gaurang 2009/5/11 Susam Pal susam@gmail.com On Tue, May 12, 2009 at 10:56 AM, Gaurang Patel gaurangtpa...@gmail.com wrote: Thanks Susam, This worked perfectly for me. Thanks for reply. *One more concern:* Does this method fetch the contents(source code

Error while running the sample search: Attribute value language + /include/header.html is quoted with which must be escaped when used within the value

2009-03-03 Thread Gaurang Patel
trace of the root cause is available in the Apache Tomcat/6.0.18 logs.* -- Apache Tomcat/6.0.18 Not sure what is happening. Can anyone help me in this? Regards, Gaurang Patel

Error while running the sample search: Attribute value language + /include/header.html is quoted with which must be escaped when used within the value

2009-03-03 Thread Gaurang Patel
* *The full stack trace of the root cause is available in the Apache Tomcat/6.0.18 logs.* -- Apache Tomcat/6.0.18 Not sure what is happening. Can anyone help me in this? Regards, Gaurang Patel