When executing a crawl, Nutch creates segments, based on the crawel
depth if I'm not mistaking, in which the fetched content is stored. For
example, if crawling a web site named site-xyz, into the directory
$nutch_home/crawls/crawl-xyz, you will find the segments into the
following directory:
Good morning,
Can you please point me to a Nutch documentation where I can find how
nutch connects to the webpages when it crawls? I think it is through
HTTP but i would like to confirm and get more details so i can write a
very small test java program to connect to one of the webpages i am
Hi,
It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.
If protocol-http is
Hello,
I would like restrict a crawl to a domain specified in a seed url without
using the urlfilter-regex plugin. The db.ignore.external.links property
looked like it would do the trick, but I've found that links that are
redirected outside the seed url get through. For example, if I start at
I should have mentioned that I'm working with the trunk.
On Mon, Jun 16, 2008 at 1:09 PM, Drew Hite [EMAIL PROTECTED] wrote:
Hello,
I would like restrict a crawl to a domain specified in a seed url without
using the urlfilter-regex plugin. The db.ignore.external.links property
looked like
Thank you for the great and detailed information Susam!
Will post back my test program when successful.
Thanks,
Ann Del Rio
-Original Message-
From: Susam Pal [mailto:[EMAIL PROTECTED]
Sent: Monday, June 16, 2008 9:48 AM
To: nutch-user@lucene.apache.org
Subject: Re: how does nutch
Hi,
I've been trying to debug this problem for the past few days in my recently
set up Nutch installation. Essentially, the main page, about page, and help
page are fine, but as soon as I try to submit a search query, I get the
following stack trace (sorry for the length):
*exception*
Hello,
We are in the process of developing a vertical search engine for the
medical industry – and I need to estimate server/sizing requirements
to setup my environment – my question is, how do I estimate how many
documents I will be fetching for a particular vertical? And – from
where do I get
This seems to be a common request - sizing. I think the best you can do is use
existing search engines to estimate how many pages sites you are interested in
have. You will have to know the exact sites (their URLs) and make use of the
site: search operator (Google, Yahoo). Yahoo also has
Yes, this is a pure CLASSPATH issue. I haven't built a Nutch war in a while,
so I don't recall what is in it, but most likely it has WEB-INF/lib directory
with some jar files. One of these ah, let's just see. Here:
[EMAIL PROTECTED] trunk]$ unzip -l build/nutch-1.0-dev.war | grep jar |
Don't have the answer, but got a question. Does this happen only when
redirection to the external host are involved?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Drew Hite [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent:
Uhuh, yes, this is most likely due to session IDs that create unique URLs that
Nutch keeps processing.
Look at conf/regex-normalize.xml for how you can clean up URLs. That should
help.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Felix
12 matches
Mail list logo