Thanks. But if I set db.ignore.external.links to false, then will it affect
the quality of the search result? I read about Nutch, and it seems that it
does something similar to Pagelink like Google. If so, it will affect the
quality of the search if it doesn't analyze the external links.
Hi Victor,
In this case, the link analysis will be done only on the link graph
between the URLs belonging to the hosts in your seed lists that you
fetch. As you said, this might not give you a true idea of the link
popularities of your URLs. On the other hand, if you set
Uroš Gruber wrote:
Andrzej Bialecki wrote:
Uroš Gruber wrote:
Hi,
I've made some changes in CrawlDbReader to read from fetchlist made
from generate command. First I thought that I have problems with
this script because some urls from inject were missing. Then I test
on only 6 urls. I've
Hi there
Could someone give me some advice on using the prune index tool? I want
a command that removes all urls that end in / or index.html.
Cheers
Aled
###
This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange.
For more
I am new to regex. What will the $1$3 reproduce in the following
element. What values are $1$3?
regex
pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern
substitution$1$3/substitution
/regex
if I leave substitution as substitution/substitution will this
just get rid of
Yes Cam, if you use a depth 1 you will crawl only the first document. With a
depth 2 you will crawl the first document and all the links found on this
document. With depth 3, you will crawl the first one, its links and all
links found in cycle 2. And so on. Increasing you depth will increasing
Philip Brown wrote:
I am new to regex. What will the $1$3 reproduce in the following
element. What values are $1$3?
regex
pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern
substitution$1$3/substitution
/regex
if I leave substitution as substitution/substitution will this
just get rid
Hello,
I would like to delete the /fetcher and the /fetchlist directory after the
fetching process is completed to free some diskspace.
Is there any reason not to do that or is it no problem at all?
your
Matthias
Is your environment windows or linux?
You are saying that most are not logged - can you please give an example
what is
logged (and where) and also what is not.
Logging in general can be configured by editing conf/log4j.properties
--
Sami Siren
2006/9/1, AJ Chen [EMAIL PROTECTED]:
When
hi,
I found there is case that two identical urls will be included in webdb. The
only difference is the with/without backslash.
saying: http://abc.com/ and http://abc.com will both appear in the dumped
webdb (one is from seeds file and the other is from the outlinkage of other
urls). Will that
I developed a plugin and tried to run it using nutch plugin
plugin-name plugin-fully-qualified-class-name arg1 arg2 of
Nutch 0.8.
But it says my plugin is not present or inactive.
I tried nutch plugin with known plugin language-identifier as:
./nutch plugin languageidentifier
My appology that the subject doesn't say what I am asking.
When I started composing the email I was thinking that
nutch plugin is not reading the config file but further inspection
revlealed that it is reading nutch-site.xml.
-Original Message-
From: Teruhiko Kurosaka
Sent:
Sami, thanks. After setting log4j.logger.org.apache.hadoop=INFO, fetcher
status is logged in hadoop.log file.
--
AJ Chen, PhD
http://web2express.org
13 matches
Mail list logo