RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

2006-09-01 Thread victor_emailbox
Thanks. But if I set db.ignore.external.links to false, then will it affect the quality of the search result? I read about Nutch, and it seems that it does something similar to Pagelink like Google. If so, it will affect the quality of the search if it doesn't analyze the external links.

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

2006-09-01 Thread Vishal Shah
Hi Victor, In this case, the link analysis will be done only on the link graph between the URLs belonging to the hosts in your seed lists that you fetch. As you said, this might not give you a true idea of the link popularities of your URLs. On the other hand, if you set

Re: bug or feature

2006-09-01 Thread Uroš Gruber
Uroš Gruber wrote: Andrzej Bialecki wrote: Uroš Gruber wrote: Hi, I've made some changes in CrawlDbReader to read from fetchlist made from generate command. First I thought that I have problems with this script because some urls from inject were missing. Then I test on only 6 urls. I've

Remove unwanted urls

2006-09-01 Thread Aled Jones
Hi there Could someone give me some advice on using the prune index tool? I want a command that removes all urls that end in / or index.html. Cheers Aled ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more

regex-normalizer.xml substitution value?

2006-09-01 Thread Philip Brown
I am new to regex. What will the $1$3 reproduce in the following element. What values are $1$3? regex pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern substitution$1$3/substitution /regex if I leave substitution as substitution/substitution will this just get rid of

Re: indexing folders with nutch

2006-09-01 Thread Lourival Júnior
Yes Cam, if you use a depth 1 you will crawl only the first document. With a depth 2 you will crawl the first document and all the links found on this document. With depth 3, you will crawl the first one, its links and all links found in cycle 2. And so on. Increasing you depth will increasing

Re: regex-normalizer.xml substitution value?

2006-09-01 Thread Philip Brown
Philip Brown wrote: I am new to regex. What will the $1$3 reproduce in the following element. What values are $1$3? regex pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern substitution$1$3/substitution /regex if I leave substitution as substitution/substitution will this just get rid

delete segments/fetcher to free diskspace

2006-09-01 Thread NG-Marketing, M.Schneider
Hello, I would like to delete the /fetcher and the /fetchlist directory after the fetching process is completed to free some diskspace. Is there any reason not to do that or is it no problem at all? your Matthias

Re: log records

2006-09-01 Thread sami siren
Is your environment windows or linux? You are saying that most are not logged - can you please give an example what is logged (and where) and also what is not. Logging in general can be configured by editing conf/log4j.properties -- Sami Siren 2006/9/1, AJ Chen [EMAIL PROTECTED]: When

same urls with only extra backslash (nutch 08)

2006-09-01 Thread Feng Ji
hi, I found there is case that two identical urls will be included in webdb. The only difference is the with/without backslash. saying: http://abc.com/ and http://abc.com will both appear in the dumped webdb (one is from seeds file and the other is from the outlinkage of other urls). Will that

How do I specify config file for nutch plugin command ?

2006-09-01 Thread Teruhiko Kurosaka
I developed a plugin and tried to run it using nutch plugin plugin-name plugin-fully-qualified-class-name arg1 arg2 of Nutch 0.8. But it says my plugin is not present or inactive. I tried nutch plugin with known plugin language-identifier as: ./nutch plugin languageidentifier

RE: How do I specify config file for nutch plugin command ?

2006-09-01 Thread Teruhiko Kurosaka
My appology that the subject doesn't say what I am asking. When I started composing the email I was thinking that nutch plugin is not reading the config file but further inspection revlealed that it is reading nutch-site.xml. -Original Message- From: Teruhiko Kurosaka Sent:

Re: log records

2006-09-01 Thread AJ Chen
Sami, thanks. After setting log4j.logger.org.apache.hadoop=INFO, fetcher status is logged in hadoop.log file. -- AJ Chen, PhD http://web2express.org