Re: [Error Crawling Job Failed] NUTCH 1.9

Muhamad Muchlis Mon, 03 Nov 2014 03:25:25 -0800

Hi Markus,

When am trying the solr index : *crawl seed.txt crawl
http://localhost:8983/solr/ <http://localhost:8983/solr/> -depth 3 -topN 5*


when iam query the solr : http://localhost:8983/solr/#/collection1/query

0 Records.


here is the Logs :

2014-11-03 18:18:54,307 INFO  crawl.Injector - Injector: starting at
2014-11-03 18:18:54
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: urlDir: seed
2014-11-03 18:18:54,309 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-11-03 18:18:54,546 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 18:18:54,601 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2014-11-03 18:18:55,119 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls rejected by filters: 0
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls after normalization: 1
2014-11-03 18:18:55,822 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: overwrite: false
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: update: false
2014-11-03 18:18:56,904 INFO  crawl.Injector - Injector: URLs merged: 1
2014-11-03 18:18:56,913 INFO  crawl.Injector - Injector: Total new urls
injected: 0
2014-11-03 18:18:56,914 INFO  crawl.Injector - Injector: finished at
2014-11-03 18:18:56, elapsed: 00:00:02


Here is my step for my first crawling:

1. crawl seed.txt crawl -depth 3 -topN 5 > log.txt
2.  *crawl seed.txt crawl http://localhost:8983/solr/
<http://localhost:8983/solr/> -depth 3 -topN 5*

*is that correct step ?.*

*reference
: http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script
<http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script>*




On Mon, Nov 3, 2014 at 6:05 PM, Markus Jelsma <[email protected]>
wrote:

> Oh - if you need to index multiple segments, don't use segments/* but -dir
> segments/
>
>
> -----Original message-----
> > From:Muhamad Muchlis <[email protected]>
> > Sent: Monday 3rd November 2014 12:00
> > To: [email protected]
> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> >
> > Hi Markus,
> >
> > When i run this command :
> >
> > *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**
> >
> >
> >
> > I got an error here is the log :
> >
> > 2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
> > 2014-11-03 17:55:04
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting
> gone
> > documents: false
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
> filtering:
> > false
> > 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
> > normalizing: false
> > 2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
> > SOLRIndexWriter
> > solr.server.url : URL of the SOLR instance (mandatory)
> > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > solr.mapping.file : name of the mapping file for fields (default
> > solrindex-mapping.xml)
> > solr.auth : use authentication (default false)
> > solr.auth.username : use authentication (default false)
> > solr.auth : username for authentication
> > solr.auth.password : password for authentication
> >
> >
> > 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/indexes
> > 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/crawldb
> > 2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/linkdb
> > 2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103163424
> > 2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103175027
> > 2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20141103175109
> > 2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
> > PriviledgedActionException as:me
> > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> > exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> > 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
> > Input path does not exist:
> >
> file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
> > at
> >
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
> > at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
> > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)
> > at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)
> > at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
> > at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
> > at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:422)
> > at
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
> > at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> >
> > Advice me please..
> >
> >
> > On Mon, Nov 3, 2014 at 5:47 PM, Muhamad Muchlis <[email protected]>
> wrote:
> >
> > > Like this ?
> > >
> > > <?xml version="1.0"?>
> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > >
> > > <!-- Put site-specific property overrides in this file. -->
> > >
> > > <configuration>
> > >
> > > <property>
> > >  <name>http.agent.name</name>
> > >  <value>My Nutch Spider</value>
> > > </property>
> > >
> > > *<property>*
> > > * <name>solr.server.url</name>*
> > > * <value>http://localhost:8983/solr/ <http://localhost:8983/solr/
> ></value>*
> > > *</property>*
> > >
> > >
> > > </configuration>
> > >
> > >
> > > On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma <
> [email protected]>
> > > wrote:
> > >
> > >> You can set solr.server.url in your nutch-site.xml or pass it via
> command
> > >> line as -Dsolr.server.url=<URL>
> > >>
> > >>
> > >>
> > >> -----Original message-----
> > >> > From:Muhamad Muchlis <[email protected]>
> > >> > Sent: Monday 3rd November 2014 11:37
> > >> > To: [email protected]
> > >> > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> > >> >
> > >> > Hi Markus,
> > >> >
> > >> > Where can I find the settings solr url?  -D
> > >> >
> > >> > On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma <
> > >> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Well, here is is:
> > >> > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> > >> > > -Dsolr.server.url
> > >> > >
> > >> > >
> > >> > >
> > >> > > -----Original message-----
> > >> > > > From:Muhamad Muchlis <[email protected]>
> > >> > > > Sent: Monday 3rd November 2014 10:58
> > >> > > > To: [email protected]
> > >> > > > Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
> > >> > > >
> > >> > > > 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
> > >> starting at
> > >> > > > 2014-11-03 16:56:06
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer:
> > >> deleting
> > >> > > gone
> > >> > > > documents: false
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > >> > > filtering:
> > >> > > > false
> > >> > > > 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
> > >> > > > normalizing: false
> > >> > > > 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing
> SOLR
> > >> URL.
> > >> > > > Should be set via -D solr.server.url
> > >> > > > SOLRIndexWriter
> > >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> > >> > > > solr.mapping.file : name of the mapping file for fields (default
> > >> > > > solrindex-mapping.xml)
> > >> > > > solr.auth : use authentication (default false)
> > >> > > > solr.auth.username : use authentication (default false)
> > >> > > > solr.auth : username for authentication
> > >> > > > solr.auth.password : password for authentication
> > >> > > >
> > >> > > > 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
> > >> > > > java.lang.RuntimeException: Missing SOLR URL. Should be set via
> -D
> > >> > > > solr.server.url
> > >> > > > SOLRIndexWriter
> > >> > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > solr.commit.size : buffer size when sending to SOLR (default
> 1000)
> > >> > > > solr.mapping.file : name of the mapping file for fields (default
> > >> > > > solrindex-mapping.xml)
> > >> > > > solr.auth : use authentication (default false)
> > >> > > > solr.auth.username : use authentication (default false)
> > >> > > > solr.auth : username for authentication
> > >> > > > solr.auth.password : password for authentication
> > >> > > >
> > >> > > > at
> > >> > > >
> > >> > >
> > >>
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
> > >> > > > at
> > >> > > >
> > >> > >
> > >>
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
> > >> > > > at
> > >> org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > >> > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >> > > > at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > >> > > >
> > >> > > >
> > >> > > > On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma <
> > >> > > [email protected]>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi - see the logs for more details.
> > >> > > > > Markus
> > >> > > > >
> > >> > > > > -----Original message-----
> > >> > > > > > From:Muhamad Muchlis <[email protected]>
> > >> > > > > > Sent: Monday 3rd November 2014 9:15
> > >> > > > > > To: [email protected]
> > >> > > > > > Subject: [Error Crawling Job Failed] NUTCH 1.9
> > >> > > > > >
> > >> > > > > > Hello.
> > >> > > > > >
> > >> > > > > > I get an error message when I run the command:
> > >> > > > > >
> > >> > > > > > *crawl seed/seed.txt crawl -depth 3 -topN 5*
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Error Message :
> > >> > > > > >
> > >> > > > > > SOLRIndexWriter
> > >> > > > > > solr.server.url : URL of the SOLR instance (mandatory)
> > >> > > > > > solr.commit.size : buffer size when sending to SOLR (default
> > >> 1000)
> > >> > > > > > solr.mapping.file : name of the mapping file for fields
> (default
> > >> > > > > > solrindex-mapping.xml)
> > >> > > > > > solr.auth : use authentication (default false)
> > >> > > > > > solr.auth.username : use authentication (default false)
> > >> > > > > > solr.auth : username for authentication
> > >> > > > > > solr.auth.password : password for authentication
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Indexer: java.io.IOException: Job failed!
> > >> > > > > > at
> > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > >> > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >> > > > > > at
> > >> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Can anyone explain why this happened ?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Best regard's
> > >> > > > > >
> > >> > > > > > M.Muchlis
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
> > >
> >
>

Re: [Error Crawling Job Failed] NUTCH 1.9

Reply via email to