Re: not writing anything to crawldb

2011-09-22 Thread Fred Zimmerman
ok. i found that the crawl is writing crawldb to my home directory instead of crawldb, presumably because I ran from the wrong place, and presumably I will be able to index this in solr from the current location. so, good news! thx On Thu, Sep 22, 2011 at 14:03, Markus Jelsma

Re: not writing anything to crawldb

2011-09-22 Thread Fred Zimmerman
Ha! but out of curiosity, why is the average score so low out of 1.0? that seems pretty darned weak, whatever it is. TOTAL urls: 1241 retry 0:1241 min score: 0.0 avg score: 0.0049016923 max score: 1.0 status 1 (db_unfetched):1001 status 2 (db_fetched): 224

Re: Interpreting Nutch results

2011-09-30 Thread Fred Zimmerman
is going on in your configuration? It might be best to readdb incrementally on smaller test fetches to make sure your fetching everything you want to. On Fri, Sep 30, 2011 at 2:23 PM, Fred Zimmerman w...@nimblebooks.com wrote: What does this mean? Why is db_unfetched so high? I want to know

when and how to delete old crawls?

2011-10-05 Thread Fred Zimmerman
hi, I have a bunch of test crawls that I have carried out in the past sitting around. most of them are indexed by solr configured per nutch-config to run again in 30 days. these old crawls are a subset of (and redundant to) my current master crawl. How should I get rid of these old crawls so

advice, config files for crawling private wikipedia mirror

2011-10-08 Thread Fred Zimmerman
HI, I am looking for advice on how to configure Nutch (and Solr) to crawl a private Wikipedia mirror. - It is my mirror on an intranet so I do not need to be polite to myself. - I need to complete this 11 million page crawl as fast as I reasonably can. - Both crawler and mirror are

solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-08 Thread Fred Zimmerman
Hi -- I am having trouble with the solrindexer parameters -- I see that Lewis had similar problems a few months ago. Any idea what I am doing wrong? bitnami@ip-10-202-202-68:~/nutch-1.3/nutch-1.3/runtime/local$ bin/nutch solrindex http://zimzazsearch3-1.bitnamiapp.com:8983/solr/ crawl/crawldb

Re: advice, config files for crawling private wikipedia mirror

2011-10-10 Thread Fred Zimmerman
' be a problem but updating the crawldb and generating fetch lists is going to be a problem. Are you also indexing? Then that will also be a very costly process. Cheers On Saturday 08 October 2011 19:29:49 Fred Zimmerman wrote: HI, I am looking for advice on how to configure Nutch (and Solr

Re: advice, config files for crawling private wikipedia mirror

2011-10-10 Thread Fred Zimmerman
and finally to Solr. On Monday 10 October 2011 16:28:02 Fred Zimmerman wrote: OK, that sounds good. Tell me about the indexing. I came across an article where someone had indexed about 10% of a wikipedia clone http://h3x.no/2011/05/10/guide-solr-performance-tuning who with a much

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-25 Thread Fred Zimmerman
I'm still having trouble with this in 1.3. looks as if there's something dumb with syntax or file structure but can't get it. $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb -linkdb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-25 23:26:02

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl/crawldb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2011-10-26 12:57:57 java.io.IOException: Job failed!

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
. If that happens there's usually i field mismatch when indexing. On Wednesday 26 October 2011 14:59:02 Fred Zimmerman wrote: OK, I've fixed the problem with the parameters giving incorrect paths to the files. Now I get this: $ bin/nutch solrindex http://search.zimzaz.com:8983/solr crawl

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
in the mailing lists. If it's open On Wednesday 26 October 2011 15:07:56 Fred Zimmerman wrote: that's it. org.apache.solr.common.SolrException: ERROR:unknown field 'content' *ERROR:unknown field 'content'* request: http://url/solr/update?wt=javabinversion=2

Re: solrindexer parameters -- input path does not exist: crawl_fetch, parse_data, etc.

2011-10-26 Thread Fred Zimmerman
archives for key words or else try the Solr user lists.I think that you are much more likely to get a substantiated response there. Thank you On Wed, Oct 26, 2011 at 3:31 PM, Fred Zimmerman zimzaz@gmail.com wrote: I added just the content field ... I have already modified solr's

1) success 2) how to tell Nutch index everything

2011-10-26 Thread Fred Zimmerman
1) I resolved the issues with solrindex. It turned out to be a matter of adding all the nutch schema-specific fields to solr's schema.xml. there was one gotcha which is that the latest solr schema does not have a default fieldtype text as in Nutch 1.3/schema.xml; you must use text_general. A

Re: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Fred Zimmerman
I'm having the exact same problem. I am trying to isolate whether it is a Solr problem or a Nutch+Solr problem. On Wed, Oct 26, 2011 at 11:54 PM, arkadi.kosmy...@csiro.au wrote: Hi, I am working with a Nutch 1.4 snapshot and having a very strange problem that makes the system run out of

A couple of basic questions re scheduled crawls.

2018-07-26 Thread Fred Zimmerman
fetch commands scheduled the first time I do bin/crawl? https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ -- Fred Zimmerman

how do fetch wait times work?

2018-04-09 Thread Fred Zimmerman
When I run bin/crawl once and it generates a segment list with a bunch of fetch dates in the future, does nutch proactively run those fetches on those future dates, or do I have to do something to make that happen?