hi , i have little problem for the use nutch to crawl website, could someone tell me what its the problem for the running crawl ?
InjectorJob: org.apache.gora.util.GoraException: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:135) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:75) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:221) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282) Caused by: java.lang.RuntimeException: org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:127) at org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161) ... 7 more Caused by: org.apache.hadoop.hbase.MasterNotRunningException: Retried 10 times at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:127) at org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:109) ... 9 more On Tue, Feb 25, 2014 at 6:01 AM, Sebastian Nagel <[email protected]> wrote: >> https://issues.apache.org/jira/browse/NUTCH-1140 > Thanks for digging this up! > >> Why is index-more adding this? > Maybe, to have some title for MIME types > which have no title (e.g., plain text). > That could be the intension. > The code is old (> 9 years) and the web > has changed since. The original > RFC http://www.ietf.org/rfc/rfc1806.txt > for the content-disposition header > is even older (1995). > > > On 02/24/2014 10:40 PM, John Lafitte wrote: >> Okay, I invoked it the way you mentioned and I get the same result. >> However, I tried it without index-more included and I no longer have the >> additional title. Why is index-more adding this? >> >> >> On Mon, Feb 24, 2014 at 3:24 PM, Sebastian Nagel <[email protected] >>> wrote: >> >>>> I'm not sure I'm allowed to post it publicly. >>> A minimalistic and anonymized example would be fine. >>> However, if it's really the HTTP header it will >>> be hard to make it reproducible. >>> >>>> I'm using the default parser-plugins.xml which shows parse-tika before >>>> feed. I don't have feed in my plugin.includes, but if I modify >>>> parser-plugins.xml and plugin.includes to try to favor the feed I still >>> get >>>> the same results. I might be doing something wrong. >>> >>> It's possible to set plugin.includes (and other properties) just for >>> tools like indexchecker, parsechecker, etc: >>> >>> % bin/nutch indexchecker >>> -Dplugin.includes="feed|index-(basic|more)|protocol-http" .../rss.xml >>> >>> >>> On 02/24/2014 09:59 PM, John Lafitte wrote: >>>> I think the channel/image/title idea was probably wrong. It looks like >>> the >>>> extra title field is actually the http header Content-Disposition: >>> inline; >>>> filename="jobexport.xml". I can email you the url privately of the >>>> specific RSS feed I'm using for this issue, but since it's a client site >>>> I'm not sure I'm allowed to post it publicly. >>>> >>>> I'm using the default parser-plugins.xml which shows parse-tika before >>>> feed. I don't have feed in my plugin.includes, but if I modify >>>> parser-plugins.xml and plugin.includes to try to favor the feed I still >>> get >>>> the same results. I might be doing something wrong. >>>> >>>> >>>> >>>> >>>> On Mon, Feb 24, 2014 at 2:20 PM, Sebastian Nagel < >>> [email protected] >>>>> wrote: >>>> >>>>> Hi John, >>>>> >>>>> can you attach an (short) example document to reproduce the problem? >>>>> I was not able to reproduce it with the example in >>>>> http://de.wikipedia.org/wiki/RSS >>>>> which contains channel/image/title. >>>>> >>>>> Which parser plugin is used: "feed" or "parse-tika"? >>>>> (In doubt, please, add the value of property "plugin.includes") >>>>> >>>>> Sebastian >>>>> >>>>> >>>>> On 02/24/2014 08:31 PM, John Lafitte wrote: >>>>>> I am using Nutch 1.7 and Solr 4.6.1. I'm having a problem with >>> indexing >>>>>> RSS that has channel/title then channel/image/title it tries to add >>> both >>>>> of >>>>>> them then fails when doing solrindex because title isn't multivalued. >>>>>> >>>>>> I've used nutch indexchecker and I see the two titles being returned. >>>>> The >>>>>> extra title is the value that in the content-disposition: filename http >>>>>> header. I only see one title when I run nutch readseg. So I'm a >>> little >>>>>> confused why it's >>>>>> >>>>>> I have made title multivalued in the solr schema and it seems to work >>>>> that >>>>>> way, but it seems wrong to me. Documents shouldn't have more than one >>>>>> title. What is the correct way to fix this? >>>>>> >>>>> >>>>> >>>> >>> >>> >> >

