Lewis, Is it possible to crawl with crawlId but HBase only crates 'webpage' table without crawlId prefix , just like Cassandra does?
And my other problems of DBUpdateJob's exception on some random urls and repeating/mixed html of all urls present in seed.txt are also resolved (disappeared) with HBase backend. But in ParseFilter plugin I still get null for page.getText() function and '0' for some other properties like fetchTime: 0 prevFetchTime: 0 fetchInterval: 0 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) Am I suppose to get proper values here or these are the expected output in ParseFilter plugin ? Thanks, Tony PS. Now I am getting correct HTML in ParseFilter with HBase backend. On Wed, Jun 26, 2013 at 1:13 PM, Tony Mullins <[email protected]>wrote: > Hi Lewis, > > As you siad UpdateDBJob doesn't expect crawlId , but if I give a crawlId > like -crawlId c10 then it doesn't create new 'webpage' and use the > pre-existing 'c10_webpage' table and solrindexjob also successfully inserts > the docs to Solr. > > I just wonder how a new nutch user like me could solve these kind of > issues without the community help. > > Thanks for the help. > Tony. > > > > On Wed, Jun 26, 2013 at 1:10 AM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Hi Tony, >> >> On Tue, Jun 25, 2013 at 1:10 AM, Tony Mullins <[email protected] >> >wrote: >> >> > >> > So what should I do now to run my complete cycle of Nutch2.x jobs and >> > insert my docs to Solr ? >> > >> > >> I'm not using HBase as backend however I know that as per the crawl >> script, >> the updatedb doesn't use crawlId parameter. Try adding the parameter >> please >> and see if it works. >> Thanks >> Lewis >> > >

