date:20060901

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

2006-09-01 Thread victor_emailbox

Thanks. But if I set db.ignore.external.links to false, then will it affect the quality of the search result? I read about Nutch, and it seems that it does something similar to Pagelink like Google. If so, it will affect the quality of the search if it doesn't analyze the external links.

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

2006-09-01 Thread Vishal Shah

Hi Victor, In this case, the link analysis will be done only on the link graph between the URLs belonging to the hosts in your seed lists that you fetch. As you said, this might not give you a true idea of the link popularities of your URLs. On the other hand, if you set

Re: bug or feature

2006-09-01 Thread Uroš Gruber

Uroš Gruber wrote: Andrzej Bialecki wrote: Uroš Gruber wrote: Hi, I've made some changes in CrawlDbReader to read from fetchlist made from generate command. First I thought that I have problems with this script because some urls from inject were missing. Then I test on only 6 urls. I've

Remove unwanted urls

2006-09-01 Thread Aled Jones

Hi there Could someone give me some advice on using the prune index tool? I want a command that removes all urls that end in / or index.html. Cheers Aled ### This message has been scanned by F-Secure Anti-Virus for Microsoft Exchange. For more

regex-normalizer.xml substitution value?

2006-09-01 Thread Philip Brown

I am new to regex. What will the $1$3 reproduce in the following element. What values are $1$3? regex pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern substitution$1$3/substitution /regex if I leave substitution as substitution/substitution will this just get rid of

Re: indexing folders with nutch

2006-09-01 Thread Lourival Júnior

Yes Cam, if you use a depth 1 you will crawl only the first document. With a depth 2 you will crawl the first document and all the links found on this document. With depth 3, you will crawl the first one, its links and all links found in cycle 2. And so on. Increasing you depth will increasing

Re: regex-normalizer.xml substitution value?

2006-09-01 Thread Philip Brown

Philip Brown wrote: I am new to regex. What will the $1$3 reproduce in the following element. What values are $1$3? regex pattern(.*)(;jsessionid=[a-zA-Z0-9]{32})(.*)/pattern substitution$1$3/substitution /regex if I leave substitution as substitution/substitution will this just get rid

delete segments/fetcher to free diskspace

2006-09-01 Thread NG-Marketing, M.Schneider

Hello, I would like to delete the /fetcher and the /fetchlist directory after the fetching process is completed to free some diskspace. Is there any reason not to do that or is it no problem at all? your Matthias

Re: log records

2006-09-01 Thread sami siren

Is your environment windows or linux? You are saying that most are not logged - can you please give an example what is logged (and where) and also what is not. Logging in general can be configured by editing conf/log4j.properties -- Sami Siren 2006/9/1, AJ Chen [EMAIL PROTECTED]: When

same urls with only extra backslash (nutch 08)

2006-09-01 Thread Feng Ji

hi, I found there is case that two identical urls will be included in webdb. The only difference is the with/without backslash. saying: http://abc.com/ and http://abc.com will both appear in the dumped webdb (one is from seeds file and the other is from the outlinkage of other urls). Will that

How do I specify config file for nutch plugin command ?

2006-09-01 Thread Teruhiko Kurosaka

I developed a plugin and tried to run it using nutch plugin plugin-name plugin-fully-qualified-class-name arg1 arg2 of Nutch 0.8. But it says my plugin is not present or inactive. I tried nutch plugin with known plugin language-identifier as: ./nutch plugin languageidentifier

RE: How do I specify config file for nutch plugin command ?

2006-09-01 Thread Teruhiko Kurosaka

My appology that the subject doesn't say what I am asking. When I started composing the email I was thinking that nutch plugin is not reading the config file but further inspection revlealed that it is reading nutch-site.xml. -Original Message- From: Teruhiko Kurosaka Sent:

Re: log records

2006-09-01 Thread AJ Chen

Sami, thanks. After setting log4j.logger.org.apache.hadoop=INFO, fetcher status is logged in hadoop.log file. -- AJ Chen, PhD http://web2express.org

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

Re: bug or feature

Remove unwanted urls

regex-normalizer.xml substitution value?

Re: indexing folders with nutch

Re: regex-normalizer.xml substitution value?

delete segments/fetcher to free diskspace

Re: log records

same urls with only extra backslash (nutch 08)

How do I specify config file for nutch plugin command ?

RE: How do I specify config file for nutch plugin command ?

Re: log records

13 matches

Site Navigation

Mail list logo

Footer information