[Error Crawling Job Failed] NUTCH 1.9
Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
RE: [Error Crawling Job Failed] NUTCH 1.9
Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
Re: [Error Crawling Job Failed] NUTCH 1.9
2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 16:56:06 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
RE: [Error Crawling Job Failed] NUTCH 1.9
Well, here is is: java.lang.RuntimeException: Missing SOLR URL. Should be set via -Dsolr.server.url -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 10:58 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 16:56:06 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
Re: [Error Crawling Job Failed] NUTCH 1.9
Hi Markus, Where can I find the settings solr url? -Dsolr.server.url On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, here is is: java.lang.RuntimeException: Missing SOLR URL. Should be set via -Dsolr.server.url -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 10:58 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 16:56:06 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
Re: [Error Crawling Job Failed] NUTCH 1.9
Like this ? ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueMy Nutch Spider/value /property *property* * namesolr.server.url/name* * valuehttp://localhost:8983/solr/ http://localhost:8983/solr//value* */property* /configuration On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: You can set solr.server.url in your nutch-site.xml or pass it via command line as -Dsolr.server.url=URL -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 11:37 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 Hi Markus, Where can I find the settings solr url? -D On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, here is is: java.lang.RuntimeException: Missing SOLR URL. Should be set via -Dsolr.server.url -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 10:58 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 16:56:06 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) Can anyone explain why this happened ? Best regard's M.Muchlis
Re: [Error Crawling Job Failed] NUTCH 1.9
Hi Markus, When i run this command : *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/** I got an error here is the log : 2014-11-03 17:55:04,602 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 17:55:04 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 17:55:04,860 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2014-11-03 17:55:04,861 INFO indexer.IndexingJob - Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/indexes 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/crawldb 2014-11-03 17:55:04,978 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/linkdb 2014-11-03 17:55:04,979 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103163424 2014-11-03 17:55:04,980 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175027 2014-11-03 17:55:04,981 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175109 2014-11-03 17:55:05,033 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation - PriviledgedActionException as:me cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at
RE: [Error Crawling Job Failed] NUTCH 1.9
No like this: *property* * namesolr.server.url/name* * valuehttp://localhost:8983/solr//value* */property* -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 11:47 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 Like this ? ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valueMy Nutch Spider/value /property *property* * namesolr.server.url/name* * valuehttp://localhost:8983/solr/ http://localhost:8983/solr//value* */property* /configuration On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: You can set solr.server.url in your nutch-site.xml or pass it via command line as -Dsolr.server.url=URL -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 11:37 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 Hi Markus, Where can I find the settings solr url? -D On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma markus.jel...@openindex.io wrote: Well, here is is: java.lang.RuntimeException: Missing SOLR URL. Should be set via -Dsolr.server.url -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 10:58 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 16:56:06 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io wrote: Hi - see the logs for more details. Markus -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 9:15 To: user@nutch.apache.org Subject: [Error Crawling Job Failed] NUTCH 1.9 Hello. I get an error message when I run the command: *crawl seed/seed.txt crawl -depth 3 -topN 5* Error Message : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication Indexer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at
RE: [Error Crawling Job Failed] NUTCH 1.9
Oh - if you need to index multiple segments, don't use segments/* but -dir segments/ -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 12:00 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 Hi Markus, When i run this command : *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/** I got an error here is the log : 2014-11-03 17:55:04,602 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 17:55:04 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 17:55:04,860 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2014-11-03 17:55:04,861 INFO indexer.IndexingJob - Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/indexes 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/crawldb 2014-11-03 17:55:04,978 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/linkdb 2014-11-03 17:55:04,979 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103163424 2014-11-03 17:55:04,980 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175027 2014-11-03 17:55:04,981 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175109 2014-11-03 17:55:05,033 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation - PriviledgedActionException as:me cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data Input path does not exist:
Re: [Error Crawling Job Failed] NUTCH 1.9
Hi Markus, When am trying the solr index : *crawl seed.txt crawl http://localhost:8983/solr/ http://localhost:8983/solr/ -depth 3 -topN 5* when iam query the solr : http://localhost:8983/solr/#/collection1/query 0 Records. here is the Logs : 2014-11-03 18:18:54,307 INFO crawl.Injector - Injector: starting at 2014-11-03 18:18:54 2014-11-03 18:18:54,308 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2014-11-03 18:18:54,308 INFO crawl.Injector - Injector: urlDir: seed 2014-11-03 18:18:54,309 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2014-11-03 18:18:54,546 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-11-03 18:18:54,601 WARN snappy.LoadSnappy - Snappy native library not loaded 2014-11-03 18:18:55,119 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2014-11-03 18:18:55,821 INFO crawl.Injector - Injector: Total number of urls rejected by filters: 0 2014-11-03 18:18:55,821 INFO crawl.Injector - Injector: Total number of urls after normalization: 1 2014-11-03 18:18:55,822 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2014-11-03 18:18:56,057 INFO crawl.Injector - Injector: overwrite: false 2014-11-03 18:18:56,057 INFO crawl.Injector - Injector: update: false 2014-11-03 18:18:56,904 INFO crawl.Injector - Injector: URLs merged: 1 2014-11-03 18:18:56,913 INFO crawl.Injector - Injector: Total new urls injected: 0 2014-11-03 18:18:56,914 INFO crawl.Injector - Injector: finished at 2014-11-03 18:18:56, elapsed: 00:00:02 Here is my step for my first crawling: 1. crawl seed.txt crawl -depth 3 -topN 5 log.txt 2. *crawl seed.txt crawl http://localhost:8983/solr/ http://localhost:8983/solr/ -depth 3 -topN 5* *is that correct step ?.* *reference : http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script* On Mon, Nov 3, 2014 at 6:05 PM, Markus Jelsma markus.jel...@openindex.io wrote: Oh - if you need to index multiple segments, don't use segments/* but -dir segments/ -Original message- From:Muhamad Muchlis tru3@gmail.com Sent: Monday 3rd November 2014 12:00 To: user@nutch.apache.org Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 Hi Markus, When i run this command : *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/** I got an error here is the log : 2014-11-03 17:55:04,602 INFO indexer.IndexingJob - Indexer: starting at 2014-11-03 17:55:04 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: deleting gone documents: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL filtering: false 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL normalizing: false 2014-11-03 17:55:04,860 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2014-11-03 17:55:04,861 INFO indexer.IndexingJob - Active IndexWriters : SOLRIndexWriter solr.server.url : URL of the SOLR instance (mandatory) solr.commit.size : buffer size when sending to SOLR (default 1000) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.auth : use authentication (default false) solr.auth.username : use authentication (default false) solr.auth : username for authentication solr.auth.password : password for authentication 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/indexes 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/crawldb 2014-11-03 17:55:04,978 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/linkdb 2014-11-03 17:55:04,979 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103163424 2014-11-03 17:55:04,980 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175027 2014-11-03 17:55:04,981 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20141103175109 2014-11-03 17:55:05,033 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation - PriviledgedActionException as:me cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse Input path does not exist: file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data Input path does not exist:
Re: When to delete the segments?
I am only indexing the parsed data in Solr , so there is no way for me to know when to delete a segment in an automated fashion by considering the parsed data alone, however I just relaized that there is a _SUCCESS file being created with in the segment once it is fetched. I will use that as an indicator to automate the deletion of the segment folders. On Mon, Nov 3, 2014 at 12:56 AM, remi tassing tassingr...@gmail.com wrote: If you are able to determine what is done with the parsed data, then you could delete the segment as soon as that job is completed. As I mentioned earlier, if the data is to be pushed to Solr (e.g. with bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT), then after indexing is done you can get rid of the segment On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote: Thanks . How do I definitively determine , if a segment has been completely parsed , if I were to set up a hourly crontab to delete the segments from HDFS? I have seen that the presence of the crawl_parse directory in the segments directory at least indicates that the parsing has started , but I think the directory would be created as soon as the parsing begins. So as to not delete the segments prematurely , while it is still being fetched , what should I be looking for in my script ? On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com wrote: The next fetching time is computed after updatedb is isssued with that segment So as long as you don't need the parsed data anymore then you can delete the segment (e.g. after indexing through Solr...). On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote: Hi All, I am deleting the segments as soon as they are fetched and parsed , I have read in previous posts that it is safe to delete the segments only if it is older than the db.default.fetch.interval , my understanding is that one does have to wait for the segment to be older than db.default.fetch.interval, but can delete it as soon as the segment is parsed. Is my understanding correct ? I want to delete the segment as soon as possible so as to save as much disk space as possible. Thanks.