date:20141103

[Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

Hello.

I get an error message when I run the command:

*crawl seed/seed.txt crawl -depth 3 -topN 5*


Error Message :

SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


Can anyone explain why this happened ?





Best regard's

M.Muchlis

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma

Hi - see the logs for more details.
Markus

-Original message-
 From:Muhamad Muchlis tru3@gmail.com
 Sent: Monday 3rd November 2014 9:15
 To: user@nutch.apache.org
 Subject: [Error Crawling Job Failed] NUTCH 1.9

 Hello.

 I get an error message when I run the command:

 *crawl seed/seed.txt crawl -depth 3 -topN 5*

 Error Message :

 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 Indexer: java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
 at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
 at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

 Can anyone explain why this happened ?

 Best regard's

 M.Muchlis

Re: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer: starting at
2014-11-03 16:56:06
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL.
Should be set via -D solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication

at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
at
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Hi - see the logs for more details.
 Markus

 -Original message-
  From:Muhamad Muchlis tru3@gmail.com
  Sent: Monday 3rd November 2014 9:15
  To: user@nutch.apache.org
  Subject: [Error Crawling Job Failed] NUTCH 1.9
 
  Hello.
 
  I get an error message when I run the command:
 
  *crawl seed/seed.txt crawl -depth 3 -topN 5*
 
 
  Error Message :
 
  SOLRIndexWriter
  solr.server.url : URL of the SOLR instance (mandatory)
  solr.commit.size : buffer size when sending to SOLR (default 1000)
  solr.mapping.file : name of the mapping file for fields (default
  solrindex-mapping.xml)
  solr.auth : use authentication (default false)
  solr.auth.username : use authentication (default false)
  solr.auth : username for authentication
  solr.auth.password : password for authentication
 
 
  Indexer: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
  at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
  at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
 
 
  Can anyone explain why this happened ?
 
 
 
 
 
  Best regard's
 
  M.Muchlis

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma

Well, here is is:
java.lang.RuntimeException: Missing SOLR URL. Should be set via 
-Dsolr.server.url

-Original message-
 From:Muhamad Muchlis tru3@gmail.com
 Sent: Monday 3rd November 2014 10:58
 To: user@nutch.apache.org
 Subject: Re: [Error Crawling Job Failed] NUTCH 1.9

 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer: starting at
 2014-11-03 16:56:06
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting gone
 documents: false
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL filtering:
 false
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
 normalizing: false
 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL.
 Should be set via -D solr.server.url
 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
 java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
 solr.server.url
 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 at
 org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
 at
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
 at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57)
 at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
 at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

 On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

  Hi - see the logs for more details.
  Markus

  -Original message-
   From:Muhamad Muchlis tru3@gmail.com
   Sent: Monday 3rd November 2014 9:15
   To: user@nutch.apache.org
   Subject: [Error Crawling Job Failed] NUTCH 1.9

   Hello.

   I get an error message when I run the command:

   *crawl seed/seed.txt crawl -depth 3 -topN 5*

   Error Message :

   SOLRIndexWriter
   solr.server.url : URL of the SOLR instance (mandatory)
   solr.commit.size : buffer size when sending to SOLR (default 1000)
   solr.mapping.file : name of the mapping file for fields (default
   solrindex-mapping.xml)
   solr.auth : use authentication (default false)
   solr.auth.username : use authentication (default false)
   solr.auth : username for authentication
   solr.auth.password : password for authentication

   Indexer: java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
   at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
   at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)

   Can anyone explain why this happened ?

   Best regard's

   M.Muchlis

Re: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

Hi Markus,

Where can I find the settings solr url?  -Dsolr.server.url

On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Well, here is is:
 java.lang.RuntimeException: Missing SOLR URL. Should be set via
 -Dsolr.server.url



 -Original message-
  From:Muhamad Muchlis tru3@gmail.com
  Sent: Monday 3rd November 2014 10:58
  To: user@nutch.apache.org
  Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
 
  2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer: starting at
  2014-11-03 16:56:06
  2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting
 gone
  documents: false
  2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
 filtering:
  false
  2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
  normalizing: false
  2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR URL.
  Should be set via -D solr.server.url
  SOLRIndexWriter
  solr.server.url : URL of the SOLR instance (mandatory)
  solr.commit.size : buffer size when sending to SOLR (default 1000)
  solr.mapping.file : name of the mapping file for fields (default
  solrindex-mapping.xml)
  solr.auth : use authentication (default false)
  solr.auth.username : use authentication (default false)
  solr.auth : username for authentication
  solr.auth.password : password for authentication
 
  2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
  java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
  solr.server.url
  SOLRIndexWriter
  solr.server.url : URL of the SOLR instance (mandatory)
  solr.commit.size : buffer size when sending to SOLR (default 1000)
  solr.mapping.file : name of the mapping file for fields (default
  solrindex-mapping.xml)
  solr.auth : use authentication (default false)
  solr.auth.username : use authentication (default false)
  solr.auth : username for authentication
  solr.auth.password : password for authentication
 
  at
 
 org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
  at
 
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
  at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57)
  at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
  at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
 
 
  On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   Hi - see the logs for more details.
   Markus
  
   -Original message-
From:Muhamad Muchlis tru3@gmail.com
Sent: Monday 3rd November 2014 9:15
To: user@nutch.apache.org
Subject: [Error Crawling Job Failed] NUTCH 1.9
   
Hello.
   
I get an error message when I run the command:
   
*crawl seed/seed.txt crawl -depth 3 -topN 5*
   
   
Error Message :
   
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
   
   
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
   
   
Can anyone explain why this happened ?
   
   
   
   
   
Best regard's
   
M.Muchlis

Re: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

Like this ?

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration

property
 namehttp.agent.name/name
 valueMy Nutch Spider/value
/property

*property*
* namesolr.server.url/name*
* valuehttp://localhost:8983/solr/ http://localhost:8983/solr//value*
*/property*


/configuration


On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 You can set solr.server.url in your nutch-site.xml or pass it via command
 line as -Dsolr.server.url=URL



 -Original message-
  From:Muhamad Muchlis tru3@gmail.com
  Sent: Monday 3rd November 2014 11:37
  To: user@nutch.apache.org
  Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
 
  Hi Markus,
 
  Where can I find the settings solr url?  -D
 
  On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma 
 markus.jel...@openindex.io
  wrote:
 
   Well, here is is:
   java.lang.RuntimeException: Missing SOLR URL. Should be set via
   -Dsolr.server.url
  
  
  
   -Original message-
From:Muhamad Muchlis tru3@gmail.com
Sent: Monday 3rd November 2014 10:58
To: user@nutch.apache.org
Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
   
2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
 starting at
2014-11-03 16:56:06
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting
   gone
documents: false
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
   filtering:
false
2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
 URL.
Should be set via -D solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
   
2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
solr.server.url
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
   
at
   
  
 org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
at
   
  
 org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
   
   
On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma 
   markus.jel...@openindex.io
wrote:
   
 Hi - see the logs for more details.
 Markus

 -Original message-
  From:Muhamad Muchlis tru3@gmail.com
  Sent: Monday 3rd November 2014 9:15
  To: user@nutch.apache.org
  Subject: [Error Crawling Job Failed] NUTCH 1.9
 
  Hello.
 
  I get an error message when I run the command:
 
  *crawl seed/seed.txt crawl -depth 3 -topN 5*
 
 
  Error Message :
 
  SOLRIndexWriter
  solr.server.url : URL of the SOLR instance (mandatory)
  solr.commit.size : buffer size when sending to SOLR (default
 1000)
  solr.mapping.file : name of the mapping file for fields (default
  solrindex-mapping.xml)
  solr.auth : use authentication (default false)
  solr.auth.username : use authentication (default false)
  solr.auth : username for authentication
  solr.auth.password : password for authentication
 
 
  Indexer: java.io.IOException: Job failed!
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
  at
 org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
  at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at
 org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
 
 
  Can anyone explain why this happened ?
 
 
 
 
 
  Best regard's
 
  M.Muchlis

Re: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

Hi Markus,

When i run this command :

*nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**



I got an error here is the log :

2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
2014-11-03 17:55:04
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting gone
documents: false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL filtering:
false
2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
normalizing: false
2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication


2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/indexes
2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/crawldb
2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/linkdb
2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103163424
2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175027
2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20141103175109
2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
PriviledgedActionException as:me
cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
Input path does not exist:
file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma



 
No like this:

  *property*
 * namesolr.server.url/name*
 * valuehttp://localhost:8983/solr//value*
 */property*

-Original message-
 From:Muhamad Muchlis tru3@gmail.com
 Sent: Monday 3rd November 2014 11:47
 To: user@nutch.apache.org
 Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
 
 Like this ?
 
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
 !-- Put site-specific property overrides in this file. --
 
 configuration
 
 property
  namehttp.agent.name/name
  valueMy Nutch Spider/value
 /property
 
 *property*
 * namesolr.server.url/name*
 * valuehttp://localhost:8983/solr/ http://localhost:8983/solr//value*
 */property*
 
 
 /configuration
 
 
 On Mon, Nov 3, 2014 at 5:41 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:
 
  You can set solr.server.url in your nutch-site.xml or pass it via command
  line as -Dsolr.server.url=URL
 
 
 
  -Original message-
   From:Muhamad Muchlis tru3@gmail.com
   Sent: Monday 3rd November 2014 11:37
   To: user@nutch.apache.org
   Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
  
   Hi Markus,
  
   Where can I find the settings solr url?  -D
  
   On Mon, Nov 3, 2014 at 5:31 PM, Markus Jelsma 
  markus.jel...@openindex.io
   wrote:
  
Well, here is is:
java.lang.RuntimeException: Missing SOLR URL. Should be set via
-Dsolr.server.url
   
   
   
-Original message-
 From:Muhamad Muchlis tru3@gmail.com
 Sent: Monday 3rd November 2014 10:58
 To: user@nutch.apache.org
 Subject: Re: [Error Crawling Job Failed] NUTCH 1.9

 2014-11-03 16:56:06,530 INFO  indexer.IndexingJob - Indexer:
  starting at
 2014-11-03 16:56:06
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: deleting
gone
 documents: false
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
filtering:
 false
 2014-11-03 16:56:06,582 INFO  indexer.IndexingJob - Indexer: URL
 normalizing: false
 2014-11-03 16:56:06,800 ERROR solr.SolrIndexWriter - Missing SOLR
  URL.
 Should be set via -D solr.server.url
 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 2014-11-03 16:56:06,802 ERROR indexer.IndexingJob - Indexer:
 java.lang.RuntimeException: Missing SOLR URL. Should be set via -D
 solr.server.url
 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 at

   
  org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
 at

   
  org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:159)
 at org.apache.nutch.indexer.IndexWriters.init(IndexWriters.java:57)
 at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:91)
 at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


 On Mon, Nov 3, 2014 at 3:41 PM, Markus Jelsma 
markus.jel...@openindex.io
 wrote:

  Hi - see the logs for more details.
  Markus
 
  -Original message-
   From:Muhamad Muchlis tru3@gmail.com
   Sent: Monday 3rd November 2014 9:15
   To: user@nutch.apache.org
   Subject: [Error Crawling Job Failed] NUTCH 1.9
  
   Hello.
  
   I get an error message when I run the command:
  
   *crawl seed/seed.txt crawl -depth 3 -topN 5*
  
  
   Error Message :
  
   SOLRIndexWriter
   solr.server.url : URL of the SOLR instance (mandatory)
   solr.commit.size : buffer size when sending to SOLR (default
  1000)
   solr.mapping.file : name of the mapping file for fields (default
   solrindex-mapping.xml)
   solr.auth : use authentication (default false)
   solr.auth.username : use authentication (default false)
   solr.auth : username for authentication
   solr.auth.password : password for authentication
  
  
   Indexer: java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
   at

RE: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Markus Jelsma

Oh - if you need to index multiple segments, don't use segments/* but -dir 
segments/

-Original message-
 From:Muhamad Muchlis tru3@gmail.com
 Sent: Monday 3rd November 2014 12:00
 To: user@nutch.apache.org
 Subject: Re: [Error Crawling Job Failed] NUTCH 1.9

 Hi Markus,

 When i run this command :

 *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**

 I got an error here is the log :

 2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
 2014-11-03 17:55:04
 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting gone
 documents: false
 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL filtering:
 false
 2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
 normalizing: false
 2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
 org.apache.nutch.indexwriter.solr.SolrIndexWriter
 2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
 SOLRIndexWriter
 solr.server.url : URL of the SOLR instance (mandatory)
 solr.commit.size : buffer size when sending to SOLR (default 1000)
 solr.mapping.file : name of the mapping file for fields (default
 solrindex-mapping.xml)
 solr.auth : use authentication (default false)
 solr.auth.username : use authentication (default false)
 solr.auth : username for authentication
 solr.auth.password : password for authentication

 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
 crawldb: crawl/indexes
 2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
 adding segment: crawl/crawldb
 2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
 adding segment: crawl/linkdb
 2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
 adding segment: crawl/segments/20141103163424
 2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
 adding segment: crawl/segments/20141103175027
 2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
 adding segment: crawl/segments/20141103175109
 2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
 native-hadoop library for your platform... using builtin-java classes where
 applicable
 2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
 PriviledgedActionException as:me
 cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
 exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_text
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/indexes/current
 2014-11-03 17:55:05,112 ERROR indexer.IndexingJob - Indexer:
 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_text
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_fetch
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_data
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/linkdb/parse_text
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/crawl_parse
 Input path does not exist:
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/segments/20141103163424/parse_data
 Input path does not exist:

Re: [Error Crawling Job Failed] NUTCH 1.9

2014-11-03 Thread Muhamad Muchlis

Hi Markus,

When am trying the solr index : *crawl seed.txt crawl
http://localhost:8983/solr/ http://localhost:8983/solr/ -depth 3 -topN 5*

when iam query the solr : http://localhost:8983/solr/#/collection1/query

0 Records.


here is the Logs :

2014-11-03 18:18:54,307 INFO  crawl.Injector - Injector: starting at
2014-11-03 18:18:54
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: crawlDb:
crawl/crawldb
2014-11-03 18:18:54,308 INFO  crawl.Injector - Injector: urlDir: seed
2014-11-03 18:18:54,309 INFO  crawl.Injector - Injector: Converting
injected urls to crawl db entries.
2014-11-03 18:18:54,546 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2014-11-03 18:18:54,601 WARN  snappy.LoadSnappy - Snappy native library not
loaded
2014-11-03 18:18:55,119 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls rejected by filters: 0
2014-11-03 18:18:55,821 INFO  crawl.Injector - Injector: Total number of
urls after normalization: 1
2014-11-03 18:18:55,822 INFO  crawl.Injector - Injector: Merging injected
urls into crawl db.
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: overwrite: false
2014-11-03 18:18:56,057 INFO  crawl.Injector - Injector: update: false
2014-11-03 18:18:56,904 INFO  crawl.Injector - Injector: URLs merged: 1
2014-11-03 18:18:56,913 INFO  crawl.Injector - Injector: Total new urls
injected: 0
2014-11-03 18:18:56,914 INFO  crawl.Injector - Injector: finished at
2014-11-03 18:18:56, elapsed: 00:00:02


Here is my step for my first crawling:

1. crawl seed.txt crawl -depth 3 -topN 5  log.txt
2.  *crawl seed.txt crawl http://localhost:8983/solr/
http://localhost:8983/solr/ -depth 3 -topN 5*

*is that correct step ?.*

*reference
: http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script
http://wiki.apache.org/nutch/NutchTutorial#a3.5._Using_the_crawl_script*




On Mon, Nov 3, 2014 at 6:05 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Oh - if you need to index multiple segments, don't use segments/* but -dir
 segments/


 -Original message-
  From:Muhamad Muchlis tru3@gmail.com
  Sent: Monday 3rd November 2014 12:00
  To: user@nutch.apache.org
  Subject: Re: [Error Crawling Job Failed] NUTCH 1.9
 
  Hi Markus,
 
  When i run this command :
 
  *nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/**
 
 
 
  I got an error here is the log :
 
  2014-11-03 17:55:04,602 INFO  indexer.IndexingJob - Indexer: starting at
  2014-11-03 17:55:04
  2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: deleting
 gone
  documents: false
  2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
 filtering:
  false
  2014-11-03 17:55:04,652 INFO  indexer.IndexingJob - Indexer: URL
  normalizing: false
  2014-11-03 17:55:04,860 INFO  indexer.IndexWriters - Adding
  org.apache.nutch.indexwriter.solr.SolrIndexWriter
  2014-11-03 17:55:04,861 INFO  indexer.IndexingJob - Active IndexWriters :
  SOLRIndexWriter
  solr.server.url : URL of the SOLR instance (mandatory)
  solr.commit.size : buffer size when sending to SOLR (default 1000)
  solr.mapping.file : name of the mapping file for fields (default
  solrindex-mapping.xml)
  solr.auth : use authentication (default false)
  solr.auth.username : use authentication (default false)
  solr.auth : username for authentication
  solr.auth.password : password for authentication
 
 
  2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
 IndexerMapReduce:
  crawldb: crawl/indexes
  2014-11-03 17:55:04,865 INFO  indexer.IndexerMapReduce -
 IndexerMapReduces:
  adding segment: crawl/crawldb
  2014-11-03 17:55:04,978 INFO  indexer.IndexerMapReduce -
 IndexerMapReduces:
  adding segment: crawl/linkdb
  2014-11-03 17:55:04,979 INFO  indexer.IndexerMapReduce -
 IndexerMapReduces:
  adding segment: crawl/segments/20141103163424
  2014-11-03 17:55:04,980 INFO  indexer.IndexerMapReduce -
 IndexerMapReduces:
  adding segment: crawl/segments/20141103175027
  2014-11-03 17:55:04,981 INFO  indexer.IndexerMapReduce -
 IndexerMapReduces:
  adding segment: crawl/segments/20141103175109
  2014-11-03 17:55:05,033 WARN  util.NativeCodeLoader - Unable to load
  native-hadoop library for your platform... using builtin-java classes
 where
  applicable
  2014-11-03 17:55:05,110 ERROR security.UserGroupInformation -
  PriviledgedActionException as:me
  cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
  exist:
 
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_fetch
  Input path does not exist:
 
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/crawl_parse
  Input path does not exist:
 
 file:/home/me/SoftwareDevelopment/Crawling/dataCrawl/crawl/crawldb/parse_data
  Input path does not exist:

Re: When to delete the segments?

2014-11-03 Thread Meraj A. Khan

I am only indexing the parsed data in Solr , so there is no way for me
to know when to delete a segment in an automated fashion by
considering the parsed data alone, however I just relaized that there
is a _SUCCESS file being created with in the segment once it is
fetched. I will use that as an indicator to automate the deletion of
the segment folders.



On Mon, Nov 3, 2014 at 12:56 AM, remi tassing tassingr...@gmail.com wrote:
 If you are able to determine what is done with the parsed data, then you
 could delete the segment as soon as that job is completed.

 As I mentioned earlier, if the data is to be pushed to Solr (e.g. with
 bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawldb $SEGMENT),
 then after indexing is done you can get rid of the segment

 On Mon, Nov 3, 2014 at 12:16 PM, Meraj A. Khan mera...@gmail.com wrote:

 Thanks .

 How do I definitively determine , if a segment has been completely
 parsed , if I were to set up a hourly crontab to delete the segments
 from HDFS? I have seen that the presence of the crawl_parse directory
 in the segments directory at least indicates that the parsing has
 started , but I think the directory would be created as  soon as the
 parsing begins.

 So as to not delete the segments prematurely , while it is still being
 fetched , what should I be looking for in my script ?

 On Sun, Nov 2, 2014 at 7:58 PM, remi tassing tassingr...@gmail.com
 wrote:
  The next fetching time is computed after updatedb is isssued with that
  segment
 
  So as long as you don't need the parsed data anymore then you can delete
  the segment (e.g. after indexing through Solr...).
 
 
 
  On Mon, Nov 3, 2014 at 8:41 AM, Meraj A. Khan mera...@gmail.com wrote:
 
  Hi All,
 
  I am deleting the segments as soon as they are fetched and parsed , I
  have read in previous posts that it is safe to delete the segments
  only if it is older than the db.default.fetch.interval , my
  understanding is that one does have to wait for the segment to be
  older than db.default.fetch.interval, but can delete it as soon as the
  segment is parsed.
 
  Is my understanding correct ? I want to delete the segment as soon as
  possible so as to save as much disk space as possible.
 
  Thanks.

[Error Crawling Job Failed] NUTCH 1.9

RE: [Error Crawling Job Failed] NUTCH 1.9

Re: [Error Crawling Job Failed] NUTCH 1.9

RE: [Error Crawling Job Failed] NUTCH 1.9

Re: [Error Crawling Job Failed] NUTCH 1.9

Re: [Error Crawling Job Failed] NUTCH 1.9

Re: [Error Crawling Job Failed] NUTCH 1.9

RE: [Error Crawling Job Failed] NUTCH 1.9

RE: [Error Crawling Job Failed] NUTCH 1.9

Re: [Error Crawling Job Failed] NUTCH 1.9

Re: When to delete the segments?

11 matches

Site Navigation

Mail list logo

Footer information