Hi Markus, When am trying the solr index : *crawl seed.txt crawl http://localhost:8983/solr/ <http://localhost:8983/solr/> -depth 3 -topN 5*
when iam query the solr : http://localhost:8983/solr/#/collection1/query 0 Records. here is the Logs : 2014-11-03 18:18:54,307 INFO crawl.Injector - Injector: starting at 2014-11-03 18:18:54 2014-11-03 18:18:54,308 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb 2014-11-03 18:18:54,308 INFO crawl.Injector - Injector: urlDir: seed 2014-11-03 18:18:54,309 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2014-11-03 18:18:54,546 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-11-03 18:18:54,601 WARN snappy.LoadSnappy - Snappy native library not loaded 2014-11-03 18:18:55,119 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2014-11-03 18:18:55,821 INFO crawl.Injector - Injector: Total number of urls rejected by filters: 0 2014-11-03 18:18:55,821 INFO crawl.Injector - Injector: Total number of urls after normalization: 1 2014-11-03 18:18:55,822 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2014-11-03 18:18:56,057 INFO crawl.Injector - Injector: overwrite: false 2014-11-03 18:18:56,057 INFO crawl.Injector - Injector: update: false 2014-11-03 18:18:56,904 INFO crawl.Injector - Injector: URLs merged: 1 2014-11-03 18:18:56,913 INFO crawl.Injector - Injector: Total new urls injected: 0 2014-11-03 18:18:56,914 INFO crawl.Injector - Injector: finished at 2014-11-03 18:18:56, elapsed: 00:00:02 Here is my step for my first crawling: 1. crawl seed.txt crawl -depth 3 -topN 5 > log.txt 2. *crawl seed.txt crawl http://localhost:8983/solr/ <http://localhost:8983/solr/> -depth 3 -topN 5* *is that correct step ?.* *reference : http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script <http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script>* On Mon, Nov 3, 2014 at 6:05 PM, Markus Jelsma <[email protected]> wrote: > Oh - if you need to index multiple segments, don't use segments/* but -dir > segments/ > > > -----Original message----- > > From:Muhamad Muchlis <[email protected]> > > Sent: Monday 3rd November 2014 12:00 > > To: [email protected] > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 > > > > Hi Markus, > > > > When i run this command : > > > > *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/** > > > > > > > > I got an error here is the log : > > > > 2014-11-03 17:55:04,602 INFO indexer.IndexingJob - Indexer: starting at > > 2014-11-03 17:55:04 > > 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: deleting > gone > > documents: false > > 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL > filtering: > > false > > 2014-11-03 17:55:04,652 INFO indexer.IndexingJob - Indexer: URL > > normalizing: false > > 2014-11-03 17:55:04,860 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.solr.SolrIndexWriter > > 2014-11-03 17:55:04,861 INFO indexer.IndexingJob - Active IndexWriters : > > SOLRIndexWriter > > solr.server.url : URL of the SOLR instance (mandatory) > > solr.commit.size : buffer size when sending to SOLR (default 1000) > > solr.mapping.file : name of the mapping file for fields (default > > solrindex-mapping.xml) > > solr.auth : use authentication (default false) > > solr.auth.username : use authentication (default false) > > solr.auth : username for authentication > > solr.auth.password : password for authentication > > > > > > 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - > IndexerMapReduce: > > crawldb: crawl/indexes > > 2014-11-03 17:55:04,865 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/crawldb > > 2014-11-03 17:55:04,978 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/linkdb > > 2014-11-03 17:55:04,979 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/segments/20141103163424 > > 2014-11-03 17:55:04,980 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/segments/20141103175027 > > 2014-11-03 17:55:04,981 INFO indexer.IndexerMapReduce - > IndexerMapReduces: > > adding segment: crawl/segments/20141103175109 > > 2014-11-03 17:55:05,033 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation - > > PriviledgedActionException as:me > > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > > exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current > > 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer: > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text > > Input path does not exist: > > > file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current > > at > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) > > at > > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) > > at > > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) > > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081) > > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073) > > at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983) > > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:422) > > at > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) > > > > Advice me please.. > > > > > > On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <[email protected]> > wrote: > > > > > Like this ? > > > > > > <?xml version="1.0"?> > > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > > > > <!-- Put site-specific property overrides in this file. --> > > > > > > <configuration> > > > > > > <property> > > > <name>http.agent.name</name> > > > <value>My Nutch Spider</value> > > > </property> > > > > > > *<property>* > > > * <name>solr.server.url</name>* > > > * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/ > ></value>* > > > *</property>* > > > > > > > > > </configuration> > > > > > > > > > On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma < > [email protected]> > > > wrote: > > > > > >> You can set solr.server.url in your nutch-site.xml or pass it via > command > > >> line as -Dsolr.server.url=<URL> > > >> > > >> > > >> > > >> -----Original message----- > > >> > From:Muhamad Muchlis <[email protected]> > > >> > Sent: Monday 3rd November 2014 11:37 > > >> > To: [email protected] > > >> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 > > >> > > > >> > Hi Markus, > > >> > > > >> > Where can I find the settings solr url? -D > > >> > > > >> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma < > > >> [email protected]> > > >> > wrote: > > >> > > > >> > > Well, here is is: > > >> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via > > >> > > -Dsolr.server.url > > >> > > > > >> > > > > >> > > > > >> > > -----Original message----- > > >> > > > From:Muhamad Muchlis <[email protected]> > > >> > > > Sent: Monday 3rd November 2014 10:58 > > >> > > > To: [email protected] > > >> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9 > > >> > > > > > >> > > > 2014-11-03 16:56:06,530 INFO indexer.IndexingJob - Indexer: > > >> starting at > > >> > > > 2014-11-03 16:56:06 > > >> > > > 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: > > >> deleting > > >> > > gone > > >> > > > documents: false > > >> > > > 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL > > >> > > filtering: > > >> > > > false > > >> > > > 2014-11-03 16:56:06,582 INFO indexer.IndexingJob - Indexer: URL > > >> > > > normalizing: false > > >> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing > SOLR > > >> URL. > > >> > > > Should be set via -D solr.server.url > > >> > > > SOLRIndexWriter > > >> > > > solr.server.url : URL of the SOLR instance (mandatory) > > >> > > > solr.commit.size : buffer size when sending to SOLR (default > 1000) > > >> > > > solr.mapping.file : name of the mapping file for fields (default > > >> > > > solrindex-mapping.xml) > > >> > > > solr.auth : use authentication (default false) > > >> > > > solr.auth.username : use authentication (default false) > > >> > > > solr.auth : username for authentication > > >> > > > solr.auth.password : password for authentication > > >> > > > > > >> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer: > > >> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via > -D > > >> > > > solr.server.url > > >> > > > SOLRIndexWriter > > >> > > > solr.server.url : URL of the SOLR instance (mandatory) > > >> > > > solr.commit.size : buffer size when sending to SOLR (default > 1000) > > >> > > > solr.mapping.file : name of the mapping file for fields (default > > >> > > > solrindex-mapping.xml) > > >> > > > solr.auth : use authentication (default false) > > >> > > > solr.auth.username : use authentication (default false) > > >> > > > solr.auth : username for authentication > > >> > > > solr.auth.password : password for authentication > > >> > > > > > >> > > > at > > >> > > > > > >> > > > > >> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192) > > >> > > > at > > >> > > > > > >> > > > > >> > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159) > > >> > > > at > > >> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57) > > >> > > > at > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91) > > >> > > > at > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) > > >> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > >> > > > at > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) > > >> > > > > > >> > > > > > >> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma < > > >> > > [email protected]> > > >> > > > wrote: > > >> > > > > > >> > > > > Hi - see the logs for more details. > > >> > > > > Markus > > >> > > > > > > >> > > > > -----Original message----- > > >> > > > > > From:Muhamad Muchlis <[email protected]> > > >> > > > > > Sent: Monday 3rd November 2014 9:15 > > >> > > > > > To: [email protected] > > >> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9 > > >> > > > > > > > >> > > > > > Hello. > > >> > > > > > > > >> > > > > > I get an error message when I run the command: > > >> > > > > > > > >> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5* > > >> > > > > > > > >> > > > > > > > >> > > > > > Error Message : > > >> > > > > > > > >> > > > > > SOLRIndexWriter > > >> > > > > > solr.server.url : URL of the SOLR instance (mandatory) > > >> > > > > > solr.commit.size : buffer size when sending to SOLR (default > > >> 1000) > > >> > > > > > solr.mapping.file : name of the mapping file for fields > (default > > >> > > > > > solrindex-mapping.xml) > > >> > > > > > solr.auth : use authentication (default false) > > >> > > > > > solr.auth.username : use authentication (default false) > > >> > > > > > solr.auth : username for authentication > > >> > > > > > solr.auth.password : password for authentication > > >> > > > > > > > >> > > > > > > > >> > > > > > Indexer: java.io.IOException: Job failed! > > >> > > > > > at > > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) > > >> > > > > > at > > >> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114) > > >> > > > > > at > > >> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176) > > >> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > >> > > > > > at > > >> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186) > > >> > > > > > > > >> > > > > > > > >> > > > > > Can anyone explain why this happened ? > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > Best regard's > > >> > > > > > > > >> > > > > > M.Muchlis > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > >

