ok. i found that the crawl is writing crawldb to my home directory instead
of crawldb, presumably because I ran from the wrong place, and presumably I
will be able to index this in solr from the current location. so, good news!
thx
On Thu, Sep 22, 2011 at 14:03, Markus Jelsma
Ha! but out of curiosity, why is the average score so low out of 1.0? that
seems pretty darned weak, whatever it is.
TOTAL urls: 1241
retry 0:1241
min score: 0.0
avg score: 0.0049016923
max score: 1.0
status 1 (db_unfetched):1001
status 2 (db_fetched): 224
is going on in your configuration?
It might be best to readdb incrementally on smaller test fetches to make
sure your fetching everything you want to.
On Fri, Sep 30, 2011 at 2:23 PM, Fred Zimmerman w...@nimblebooks.com
wrote:
What does this mean? Why is db_unfetched so high?
I want to know
hi,
I have a bunch of test crawls that I have carried out in the past sitting
around. most of them are indexed by solr configured per nutch-config to run
again in 30 days. these old crawls are a subset of (and redundant to) my
current master crawl. How should I get rid of these old crawls so
HI,
I am looking for advice on how to configure Nutch (and Solr) to crawl a
private Wikipedia mirror.
- It is my mirror on an intranet so I do not need to be polite to myself.
- I need to complete this 11 million page crawl as fast as I reasonably
can.
- Both crawler and mirror are
Hi -- I am having trouble with the solrindexer parameters -- I see that
Lewis had similar problems a few months ago. Any idea what I am doing wrong?
bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch
solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/ crawl/crawldb
' be a problem but updating the crawldb and generating
fetch
lists is going to be a problem.
Are you also indexing? Then that will also be a very costly process.
Cheers
On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote:
HI,
I am looking for advice on how to configure Nutch (and Solr
and finally to Solr.
On Monday 10 October 2011 16:28:02 Fred Zimmerman wrote:
OK, that sounds good. Tell me about the indexing. I came across an
article where someone had indexed about 10% of a wikipedia clone
http://h3x.no/2011/05/10/guide-solr-performance-tuning
who with a much
I'm still having trouble with this in 1.3. looks as if there's something
dumb with syntax or file structure but can't get it.
$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-10-25 23:26:02
OK, I've fixed the problem with the parameters giving incorrect paths to the
files. Now I get this:
$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb
crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2011-10-26 12:57:57
java.io.IOException: Job failed!
. If that happens there's usually i field
mismatch when indexing.
On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote:
OK, I've fixed the problem with the parameters giving incorrect paths to
the files. Now I get this:
$ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl
in the mailing lists. If it's
open
On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote:
that's it.
org.apache.solr.common.SolrException: ERROR:unknown field 'content'
*ERROR:unknown field 'content'*
request: http://url/solr/update?wt=javabinversion=2
archives for key words or else try the
Solr user lists.I think that you are much more likely to get a
substantiated
response there.
Thank you
On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.com
wrote:
I added just the content field ... I have already modified solr's
1) I resolved the issues with solrindex. It turned out to be a matter of
adding all the nutch schema-specific fields to solr's schema.xml. there was
one gotcha which is that the latest solr schema does not have a default
fieldtype text as in Nutch 1.3/schema.xml; you must use text_general. A
I'm having the exact same problem. I am trying to isolate whether it is a
Solr problem or a Nutch+Solr problem.
On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote:
Hi,
I am working with a Nutch 1.4 snapshot and having a very strange problem
that makes the system run out of
fetch commands scheduled the first time I do bin/crawl?
https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
--
Fred Zimmerman
When I run bin/crawl once and it generates a segment list with a bunch of
fetch dates in the future, does nutch proactively run those fetches on
those future dates, or do I have to do something to make that happen?
17 matches
Mail list logo