RE: where nutch store crawled data

2008-06-16 Thread POIRIER David
When executing a crawl, Nutch creates segments, based on the crawel depth if I'm not mistaking, in which the fetched content is stored. For example, if crawling a web site named site-xyz, into the directory $nutch_home/crawls/crawl-xyz, you will find the segments into the following directory:

how does nutch connect to urls internally?

2008-06-16 Thread Del Rio, Ann
Good morning, Can you please point me to a Nutch documentation where I can find how nutch connects to the webpages when it crawls? I think it is through HTTP but i would like to confirm and get more details so i can write a very small test java program to connect to one of the webpages i am

Re: how does nutch connect to urls internally?

2008-06-16 Thread Susam Pal
Hi, It depends on which protocol plugin is enabled in your 'conf/nutch-site.xml'. The property to look for is 'plugins.include' in the XML file. If this is not present in 'conf/nutch-site.xml', it means you are using the default 'plugins.include' of 'conf/nutch-default.xml'. If protocol-http is

db.ignore.external.links=true and redirects

2008-06-16 Thread Drew Hite
Hello, I would like restrict a crawl to a domain specified in a seed url without using the urlfilter-regex plugin. The db.ignore.external.links property looked like it would do the trick, but I've found that links that are redirected outside the seed url get through. For example, if I start at

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Drew Hite
I should have mentioned that I'm working with the trunk. On Mon, Jun 16, 2008 at 1:09 PM, Drew Hite [EMAIL PROTECTED] wrote: Hello, I would like restrict a crawl to a domain specified in a seed url without using the urlfilter-regex plugin. The db.ignore.external.links property looked like

RE: how does nutch connect to urls internally?

2008-06-16 Thread Del Rio, Ann
Thank you for the great and detailed information Susam! Will post back my test program when successful. Thanks, Ann Del Rio -Original Message- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Monday, June 16, 2008 9:48 AM To: nutch-user@lucene.apache.org Subject: Re: how does nutch

ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread John Thompson
Hi, I've been trying to debug this problem for the past few days in my recently set up Nutch installation. Essentially, the main page, about page, and help page are fine, but as soon as I try to submit a search query, I get the following stack trace (sorry for the length): *exception*

getting seed list for vertical search engine

2008-06-16 Thread DS jha
Hello, We are in the process of developing a vertical search engine for the medical industry – and I need to estimate server/sizing requirements to setup my environment – my question is, how do I estimate how many documents I will be fetching for a particular vertical? And – from where do I get

Re: getting seed list for vertical search engine

2008-06-16 Thread Otis Gospodnetic
This seems to be a common request - sizing. I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have. You will have to know the exact sites (their URLs) and make use of the site: search operator (Google, Yahoo). Yahoo also has

Re: ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread Otis Gospodnetic
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while, so I don't recall what is in it, but most likely it has WEB-INF/lib directory with some jar files. One of these ah, let's just see. Here: [EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar |

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Otis Gospodnetic
Don't have the answer, but got a question. Does this happen only when redirection to the external host are involved? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Drew Hite [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent:

Re: infinite loop-problem

2008-06-16 Thread Otis Gospodnetic
Uhuh, yes, this is most likely due to session IDs that create unique URLs that Nutch keeps processing. Look at conf/regex-normalize.xml for how you can clean up URLs. That should help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Felix