into a single crawldb and a single
segments then re-run the invertlinks and index to create a single
index file which can then be searched.
Dennis
Feng Ji wrote:
Hi there,
In Nutch 08, I have crawled down from two webDB independently.
For each run, I did invertlinks and index. So each one
Hi there,
I want to filter out particular ursl from search result.
And I try to use segement merger to do it;
Firstly, I put target urls in regex-urlfiter.txt and automaton-urlfiter.txt,
as -http://abc.com/;.
then, run nutch/mergesegs and nutch/index, but the search page still
show the urls I
hi there,
By using nutch 08, it costs me more than 1 day to crawl down 30,000 pages
from 1 crawldb list. I am using linux and java 1.5, in a dual CPU dell
server.
My fetching setting is from default, means the file size is limited.
I wonder if other things I can do to speed up the crawling
hi Frank:
Is the following config for your thread setup?
fetcher.threads.per.host in nutch-default.xml
thanks,
Michael,
On 9/4/06, Frank Kempf [EMAIL PROTECTED] wrote:
Hi,
this sure is a question about scaling an application in general.
You could be either bottlenecked by
1. Network
2.
Hi there,
In Nutch 08, I have crawled down from two webDB independently.
For each run, I did invertlinks and index. So each one is searchable.
Now I want to combine them togeter for search. I tried merge command to
merge two indexes, but the search for the result index output dir is dull.
Do I
hi,
I found there is case that two identical urls will be included in webdb. The
only difference is the with/without backslash.
saying: http://abc.com/ and http://abc.com will both appear in the dumped
webdb (one is from seeds file and the other is from the outlinkage of other
urls). Will that
hi,
I follow the nutch08 tutorial. The step to do crawling is inject,
generator, fetch, update.
But there is a command in nutch/bin, called parse, which parse a segment's
page. I wonder if I should use it before update in the above steps.
Currently, I didn't use parse cmd and update still see
Exactly!
It solves my puzzle,
thanks,
Michael,
On 8/30/06, Zaheed Haque [EMAIL PROTECTED] wrote:
Hi
Cos you have parse option true in nutch-site.xml. Try set it to false
if you want to parse it manually. Or overide config with fetch
-noParsing option.
Cheers
On 8/30/06, Feng Ji [EMAIL
hi there,
I got the huge percentage of fetching error for httpclient in hadoop log as
followings:
httpclient.HttpMethodDirector
:
httpclient.HttpMethodDirector - Redirect requested but followRedirects is
disabled
:
I setup plugin.includes in nutch-site.xml as
hi there,
I running on Nutch 0.8.
A weird thing is that some urls is generated in fetchlist ( I dubugging
print out url in map() of generator.java and checked the dumped text from
/crawl_generate ). These urls are in fetchlist.
But I couldn't find them in the log/hadoop for fetcher segment.
/crawl_generate
Any hint you could provide?
thanks,
Michael,
On 8/30/06, Feng Ji [EMAIL PROTECTED] wrote:
hi there,
I running on Nutch 0.8.
A weird thing is that some urls is generated in fetchlist ( I dubugging
print out url in map() of generator.java and checked the dumped text from
Hi there,
I used indexer to store one additional field in lucene index,
Field.Store.YES, Field.Index.NO. (I will only add one single field, I see
the discuss about performance penalty of this)
then, I want to retrieve it from nutch's search page. I took a look of how
nutch to get explanation
Hi there,
I found nutch-0.8. using apache's commons logging system
http://jakarta.apache.org/commons/logging/apidocs/index.html
under the developing stage, I'd like to turn on debug mode
if (log.isDebugEnabled()) {
...
I checked nutch-default.xml, but can't find a place to turn it on.
Does
hi there,
I found there is no log while running nutch-0.8. release package.
For example, in fetcher.java , LOG.isInfoEnabled() is turned to false, so no
fetching URL information is showing.
I wonder how to turn log on? I checked the nutch-default.xml and can't find
a field.
Anyone could give
I tried the nutch08 release.
http://lucene.apache.org/nutch/#25+July+2006%3A+Nutch+0.8+Released
Everything is working fine. I guess the unstability of the version checked
out from SVN is due to nutch09's on-going development.
Michael,
On 8/8/06, Feng Ji [EMAIL PROTECTED] wrote:
hi
Cheers
Zaheed
On 8/6/06, Feng Ji [EMAIL PROTECTED] wrote:
Hi there,
I wonder if any one has the similar experience as mine.
I checked out a nutch 08 today,
svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk/ nutch
,
with version tag of 428997
However, somehow, I got the following
Hi there,
I wonder if any one has the similar experience as mine.
I checked out a nutch 08 today,
svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk/ nutch ,
with version tag of 428997
However, somehow, I got the following weird error log for a single url
crawling
Fetcher:
I have difficult to find which Java class I could find these functions.
thanks,
Feng Ji
On 6/25/06, Andrzej Bialecki [EMAIL PROTECTED] wrote:
TDLN wrote:
In 0.8-dev score is calculated in a ScoringFilter implementaion,
default is score-opic plugin
hi,
I wonder when is the case for nutch to setup STATUS_SIGNATURE for
CrawlDatum.
Just curiously when I saw this flag in that class.
thanks,
Michael Ji,
Hi there,
I wonder which nutch/bin/ command call or which java in nutch 08 could do
the similar thing as org.apache.nutch.tools.LinkAnalysisTool did in nutch
07, which will iteratively caculate page score for each url.
thanks,
Feng Ji
Hi there,
I have successfully checkout Nutch and compiled successfully, thanks
all the hints;
by the way, what is the difference between
Anonymous Subversion and Committer Subversion Access
I guess Committer Subversion Access has the right to check code back
in. Is it right,
thanks,
21 matches
Mail list logo