Re: how to combine two run's result for search

2006-09-05 Thread Feng Ji
into a single crawldb and a single segments then re-run the invertlinks and index to create a single index file which can then be searched. Dennis Feng Ji wrote: Hi there, In Nutch 08, I have crawled down from two webDB independently. For each run, I did invertlinks and index. So each one

filter urls from search result

2006-09-05 Thread Feng Ji
Hi there, I want to filter out particular ursl from search result. And I try to use segement merger to do it; Firstly, I put target urls in regex-urlfiter.txt and automaton-urlfiter.txt, as -http://abc.com/;. then, run nutch/mergesegs and nutch/index, but the search page still show the urls I

how to speed up crawling procedure

2006-09-04 Thread Feng Ji
hi there, By using nutch 08, it costs me more than 1 day to crawl down 30,000 pages from 1 crawldb list. I am using linux and java 1.5, in a dual CPU dell server. My fetching setting is from default, means the file size is limited. I wonder if other things I can do to speed up the crawling

Re: how to speed up crawling procedure

2006-09-04 Thread Feng Ji
hi Frank: Is the following config for your thread setup? fetcher.threads.per.host in nutch-default.xml thanks, Michael, On 9/4/06, Frank Kempf [EMAIL PROTECTED] wrote: Hi, this sure is a question about scaling an application in general. You could be either bottlenecked by 1. Network 2.

how to combine two run's result for search

2006-09-04 Thread Feng Ji
Hi there, In Nutch 08, I have crawled down from two webDB independently. For each run, I did invertlinks and index. So each one is searchable. Now I want to combine them togeter for search. I tried merge command to merge two indexes, but the search for the result index output dir is dull. Do I

same urls with only extra backslash (nutch 08)

2006-09-01 Thread Feng Ji
hi, I found there is case that two identical urls will be included in webdb. The only difference is the with/without backslash. saying: http://abc.com/ and http://abc.com will both appear in the dumped webdb (one is from seeds file and the other is from the outlinkage of other urls). Will that

when to use cmd parse to parse a segment's pages

2006-08-30 Thread Feng Ji
hi, I follow the nutch08 tutorial. The step to do crawling is inject, generator, fetch, update. But there is a command in nutch/bin, called parse, which parse a segment's page. I wonder if I should use it before update in the above steps. Currently, I didn't use parse cmd and update still see

Re: when to use cmd parse to parse a segment's pages

2006-08-30 Thread Feng Ji
Exactly! It solves my puzzle, thanks, Michael, On 8/30/06, Zaheed Haque [EMAIL PROTECTED] wrote: Hi Cos you have parse option true in nutch-site.xml. Try set it to false if you want to parse it manually. Or overide config with fetch -noParsing option. Cheers On 8/30/06, Feng Ji [EMAIL

httpclient fetcher error in hadoop log

2006-08-30 Thread Feng Ji
hi there, I got the huge percentage of fetching error for httpclient in hadoop log as followings: httpclient.HttpMethodDirector : httpclient.HttpMethodDirector - Redirect requested but followRedirects is disabled : I setup plugin.includes in nutch-site.xml as

some urls in fetch list is not being fetched

2006-08-30 Thread Feng Ji
hi there, I running on Nutch 0.8. A weird thing is that some urls is generated in fetchlist ( I dubugging print out url in map() of generator.java and checked the dumped text from /crawl_generate ). These urls are in fetchlist. But I couldn't find them in the log/hadoop for fetcher segment.

Re: some urls in fetch list is not being fetched

2006-08-30 Thread Feng Ji
/crawl_generate Any hint you could provide? thanks, Michael, On 8/30/06, Feng Ji [EMAIL PROTECTED] wrote: hi there, I running on Nutch 0.8. A weird thing is that some urls is generated in fetchlist ( I dubugging print out url in map() of generator.java and checked the dumped text from

show additional lucene index information on Nutch's Search Page

2006-08-20 Thread Feng Ji
Hi there, I used indexer to store one additional field in lucene index, Field.Store.YES, Field.Index.NO. (I will only add one single field, I see the discuss about performance penalty of this) then, I want to retrieve it from nutch's search page. I took a look of how nutch to get explanation

turn on debug log on nutch-0.8.

2006-08-11 Thread Feng Ji
Hi there, I found nutch-0.8. using apache's commons logging system http://jakarta.apache.org/commons/logging/apidocs/index.html under the developing stage, I'd like to turn on debug mode if (log.isDebugEnabled()) { ... I checked nutch-default.xml, but can't find a place to turn it on. Does

how to show log in nutch-0.8. release package

2006-08-09 Thread Feng Ji
hi there, I found there is no log while running nutch-0.8. release package. For example, in fetcher.java , LOG.isInfoEnabled() is turned to false, so no fetching URL information is showing. I wonder how to turn log on? I checked the nutch-default.xml and can't find a field. Anyone could give

Re: nutch08 indexer error

2006-08-08 Thread Feng Ji
I tried the nutch08 release. http://lucene.apache.org/nutch/#25+July+2006%3A+Nutch+0.8+Released Everything is working fine. I guess the unstability of the version checked out from SVN is due to nutch09's on-going development. Michael, On 8/8/06, Feng Ji [EMAIL PROTECTED] wrote: hi

Works by Adding Agency --- Data Re: fetcher failure

2006-08-06 Thread Feng Ji
Cheers Zaheed On 8/6/06, Feng Ji [EMAIL PROTECTED] wrote: Hi there, I wonder if any one has the similar experience as mine. I checked out a nutch 08 today, svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk/ nutch , with version tag of 428997 However, somehow, I got the following

fetcher failure

2006-08-05 Thread Feng Ji
Hi there, I wonder if any one has the similar experience as mine. I checked out a nutch 08 today, svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk/ nutch , with version tag of 428997 However, somehow, I got the following weird error log for a single url crawling Fetcher:

Re: page ranking computation in Nutch 08

2006-07-14 Thread Feng Ji
I have difficult to find which Java class I could find these functions. thanks, Feng Ji On 6/25/06, Andrzej Bialecki [EMAIL PROTECTED] wrote: TDLN wrote: In 0.8-dev score is calculated in a ScoringFilter implementaion, default is score-opic plugin

when to use STATUS_SIGNATURE in CrawlDatum

2006-07-04 Thread Feng Ji
hi, I wonder when is the case for nutch to setup STATUS_SIGNATURE for CrawlDatum. Just curiously when I saw this flag in that class. thanks, Michael Ji,

page ranking computation in Nutch 08

2006-06-24 Thread Feng Ji
Hi there, I wonder which nutch/bin/ command call or which java in nutch 08 could do the similar thing as org.apache.nutch.tools.LinkAnalysisTool did in nutch 07, which will iteratively caculate page score for each url. thanks, Feng Ji

Re: Classnotfoundexception in https plugin

2005-07-19 Thread Feng Ji
Hi there, I have successfully checkout Nutch and compiled successfully, thanks all the hints; by the way, what is the difference between Anonymous Subversion and Committer Subversion Access I guess Committer Subversion Access has the right to check code back in. Is it right, thanks,