date:20080616

RE: where nutch store crawled data

2008-06-16 Thread POIRIER David

When executing a crawl, Nutch creates segments, based on the crawel depth if I'm not mistaking, in which the fetched content is stored. For example, if crawling a web site named site-xyz, into the directory $nutch_home/crawls/crawl-xyz, you will find the segments into the following directory:

how does nutch connect to urls internally?

2008-06-16 Thread Del Rio, Ann

Good morning, Can you please point me to a Nutch documentation where I can find how nutch connects to the webpages when it crawls? I think it is through HTTP but i would like to confirm and get more details so i can write a very small test java program to connect to one of the webpages i am

Re: how does nutch connect to urls internally?

2008-06-16 Thread Susam Pal

Hi, It depends on which protocol plugin is enabled in your 'conf/nutch-site.xml'. The property to look for is 'plugins.include' in the XML file. If this is not present in 'conf/nutch-site.xml', it means you are using the default 'plugins.include' of 'conf/nutch-default.xml'. If protocol-http is

db.ignore.external.links=true and redirects

2008-06-16 Thread Drew Hite

Hello, I would like restrict a crawl to a domain specified in a seed url without using the urlfilter-regex plugin. The db.ignore.external.links property looked like it would do the trick, but I've found that links that are redirected outside the seed url get through. For example, if I start at

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Drew Hite

I should have mentioned that I'm working with the trunk. On Mon, Jun 16, 2008 at 1:09 PM, Drew Hite [EMAIL PROTECTED] wrote: Hello, I would like restrict a crawl to a domain specified in a seed url without using the urlfilter-regex plugin. The db.ignore.external.links property looked like

RE: how does nutch connect to urls internally?

2008-06-16 Thread Del Rio, Ann

Thank you for the great and detailed information Susam! Will post back my test program when successful. Thanks, Ann Del Rio -Original Message- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Monday, June 16, 2008 9:48 AM To: nutch-user@lucene.apache.org Subject: Re: how does nutch

ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread John Thompson

Hi, I've been trying to debug this problem for the past few days in my recently set up Nutch installation. Essentially, the main page, about page, and help page are fine, but as soon as I try to submit a search query, I get the following stack trace (sorry for the length): *exception*

getting seed list for vertical search engine

2008-06-16 Thread DS jha

Hello, We are in the process of developing a vertical search engine for the medical industry – and I need to estimate server/sizing requirements to setup my environment – my question is, how do I estimate how many documents I will be fetching for a particular vertical? And – from where do I get

Re: getting seed list for vertical search engine

2008-06-16 Thread Otis Gospodnetic

This seems to be a common request - sizing. I think the best you can do is use existing search engines to estimate how many pages sites you are interested in have. You will have to know the exact sites (their URLs) and make use of the site: search operator (Google, Yahoo). Yahoo also has

Re: ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

2008-06-16 Thread Otis Gospodnetic

Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while, so I don't recall what is in it, but most likely it has WEB-INF/lib directory with some jar files. One of these ah, let's just see. Here: [EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar |

Re: db.ignore.external.links=true and redirects

2008-06-16 Thread Otis Gospodnetic

Don't have the answer, but got a question. Does this happen only when redirection to the external host are involved? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Drew Hite [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent:

Re: infinite loop-problem

2008-06-16 Thread Otis Gospodnetic

Uhuh, yes, this is most likely due to session IDs that create unique URLs that Nutch keeps processing. Look at conf/regex-normalize.xml for how you can clean up URLs. That should help. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Felix

RE: where nutch store crawled data

how does nutch connect to urls internally?

Re: how does nutch connect to urls internally?

db.ignore.external.links=true and redirects

Re: db.ignore.external.links=true and redirects

RE: how does nutch connect to urls internally?

ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

getting seed list for vertical search engine

Re: getting seed list for vertical search engine

Re: ClassNotFoundException: org.apache.nutch.analysis.CommonGrams

Re: db.ignore.external.links=true and redirects

Re: infinite loop-problem

12 matches

Site Navigation

Mail list logo

Footer information