Jason Camp wrote:
Hi,
I'm trying to gage whether one crawl server is performing well, and I'm
having a tough time trying to determine if I could increase settings to
gain faster crawls, or if I'm approaching the max the server can handle.
The server is a dual AMD Althon 2200 with 2GB of ram
Seems to be a bit better, doesn't it?
bash-3.00$ bin/nutch crawl urls -dir crawled -depht 2
060425 110124 parsing
jar:file:/home/../nutch-nightly/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
060425 110124 parsing file:/home/../nutch-nightly/conf/nutch-default.xml
060425 110124 parsing
jar:file:/home/../nutch-nightly/lib/hadoop-0.1-dev.jar!/hadoop-default.xml
Somehow you are still using hadoop-0.1 and not 0.1.1. I am not sure if
this update will solve your problem but it might. With the config I
sent you, I could, crawl-index-serach so there must be something
else.. I am not
Sorry, my mistake. changed to 0.1.1
results:
bash-3.00$ bin/nutch crawl urls -dir crawled -depht 2
060425 113831 parsing
jar:file:/home/../nutch-nightly/lib/hadoop-0.1.1.jar!/hadoop-default.xml
060425 113831 parsing file:/home/../nutch-nightly/conf/nutch-default.xml
060425 113832 parsing
Iv'e applied the patch in the ticket linked to below. I browesed the
patch to
try to figure out how to use this plugin, and I'm having troubles trying
to get it
working.
Before I get into the details, if someone has a source of information
describing
how nutch starts up and initializes plugins
Shouldn't that be subcollection:wiki instead? Also I assumed you had
subcollection added to plugin.includes in the config file (nutch-site.xml).
Andrew Libby wrote:
Iv'e applied the patch in the ticket linked to below. I browesed the
patch to
try to figure out how to use this plugin, and
Okay, I sort of answered my own question on this.
From looking at the index with luke, it seemed that I can use the url field
to restrict searches. I found that the main categories in my site had
url field
values that were equal to the top level path in the url. So I just add
search
terms
Can somebody direct me on how to get the stored text and parse metadata
for a given url?
Dennis
NutchBean.getContent() and NutchBean.getParseData() do this, but require
a HitDetails instance. In the non-distributed case, the only required
field of the HitDetails for these calls is url. In the distributed
case, the segment field must also be provided, so that the request can
be routed
That got me started. I think that I am not fully understanding the role
the segments directory and its contents play. It looks like it holds
parse text and parse data in map files, but what is the content folder
(also a map file)? And is the segments contents used once the index is
created?
Truly I am just not understanding the concept of a segment.
Dennis Kubes wrote:
That got me started. I think that I am not fully understanding the
role the segments directory and its contents play. It looks like it
holds parse text and parse data in map files, but what is the content
folder
Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse
metadata for a given url?
From a single segment, or from a set of segments?
From a single segment: please see how SegmentReader.get() does this
(although it's a bit obscured by the fact that it uses multiple
Any other Australian's out there?
I created a frappr map http://www.frappr.com/nutch if people are
interested..
regards
Ian
On 24/04/2006, at 8:17 AM, Stefan Groschupf wrote:
Berlin would work for me too. Just so we plan it well in
advance ... ( 4 weeks).
The wizards of open source
Dennis Kubes wrote:
I think that I am not fully understanding the role
the segments directory and its contents play.
A segment is simply a set of urls fetched in the same round, and data
associated with these urls. The content subdirectory contains the raw
http content. The parse-text
14 matches
Mail list logo