If you follow the tutorial then the command should be: $ bin/nutch generate crawl/crawldb crawldb/segments
On Wed, 9 May 2012 17:05:51 +0100, Lewis John Mcgibbney <[email protected]> wrote:
Which segments are you trying to generate from? Do you maybe need to include them individually? or use a wildcard? bin/nutch generate crawldb crawldb/segments/* bin/nutch generate crawldb crawldb/segments/segmentNo ? On Wed, May 9, 2012 at 3:33 PM, Stephan Kristyn wrote: Ok now at the heading "Step-by-Step: Fetching" I get -bash-4.1$ bin/nutch generate crawldb crawldb/segments Generator: starting at 2012-05-09 14:32:44 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Generator.generate(Generator.java:538) at org.apache.nutch.crawl.Generator.run(Generator.java:704) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Generator.main(Generator.java:660) Strange... Am 09.05.2012 16:04, schrieb Stephan Kristyn: Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in the tutorial: http://wiki.apache.org/nutch/NutchTutorial [2] I'll let you know if and how that worked out for me. Am 09.05.2012 14:28, schrieb Stephan Kristyn: This is the query that the SOLR interface generates when I enter "test" and hit the serach button: http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on [3] Maybe this is a question better suited for the Solr ML? From: Lewis John Mcgibbney [mailto:[email protected] [4]] Sent: Mittwoch, 9. Mai 2012 13:34 To: [email protected] [5] Subject: Re: HTTP ERROR 400 are you attempting to index to Solr or is this simply when you start you solr server? On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn wrote: I copied over the schema and everything else in conf from nutch. $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: Which schema are you using with your SOlr server? On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn [8] [9] wrote: Also.. entering java -jar post.jar *.xml on RHEL6 I get a INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=GB18030TEST] unknown field 'name' Thanks, Stephan Am 09.05.2012 12:11, schrieb Stephan Kristyn: Hi, after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text ------------------------------------------------------------------------ /Powered by Jetty:// /Any ideas how to fix this? Thanks, Stephan -- stephan kristyn partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [email protected] [10] [11] direct +49 (0)89 231 97 207 [12] mobile +49 (0) 162 28899 02 [13] yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 [14] fax (408) 349 3301 [15] [cid:[email protected]] -- Lewis -- STEPHAN KRISTYN partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [email protected] [16] direct +49 (0)89 231 97 207 [17] mobile +49 (0) 162 28899 02 [18] yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 [19] fax (408) 349 3301 [20] -- STEPHAN KRISTYN partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [email protected] [21] direct +49 (0)89 231 97 207 [22] mobile +49 (0) 162 28899 02 [23] yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 [24] fax (408) 349 3301 [25] -- _Lewis_ Links: ------ [1] mailto:[email protected] [2] http://wiki.apache.org/nutch/NutchTutorial [3] http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on [4] mailto:[email protected] [5] mailto:[email protected] [6] mailto:[email protected] [7] mailto:[email protected] [8] mailto:[email protected] [9] mailto:[email protected] [10] mailto:[email protected] [11] mailto:[email protected] [12] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [13] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [14] http://webmail.openindex.io/tel:%28408%29%20349%203300 [15] http://webmail.openindex.io/tel:%28408%29%20349%203301 [16] mailto:[email protected] [17] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [18] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [19] http://webmail.openindex.io/tel:%28408%29%20349%203300 [20] http://webmail.openindex.io/tel:%28408%29%20349%203301 [21] mailto:[email protected] [22] http://webmail.openindex.io/tel:%2B49%20%280%2989%20231%2097%20207 [23] http://webmail.openindex.io/tel:%2B49%20%280%29%20162%2028899%2002 [24] http://webmail.openindex.io/tel:%28408%29%20349%203300 [25] http://webmail.openindex.io/tel:%28408%29%20349%203301
-- Markus Jelsma - CTO - Openindex

