Ok, so avro did not work quite well for me, I got a test grid with hbase, and I started using that for now. All steps ran without errors and I see my crawled doc in hbase. However, after running the solr integration, and querying solr, I get back nothing. Index files look very tiny. The one thing I noted is a message during almost every step
13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass match but mismatching table names mappingfile schema is 'webpage' vs actual schema 'crawl2_webpage' , assuming they are the same. This looks suspicious and I think this is the one causing the solr index to be empty. Googling suggested I should edit the nutch-default,xml, I tried and rebuilt the job but no luck with this message. On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote: > Ok, I ran a ant, ant jar and ant job and that seems to have picked up the > config changes. > Now, the inject output shows that it is using AvroStore as Gora storage. > > Now I am getting Nullpointer on > > java.lang.NullPointerException > at > org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70) > at > org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91) > at > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) > at org.apache.hadoop.mapred.Child.main(Child.java:264) > > which does not look like nutch related. I will work on this and write back > if I get stuck on something else, or will write back if I succeed. > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote: > >> Hi Lewis, >> >> Sorry for missing that one. So I update the top level conf and rebuild >> the job. >> >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml >> >> ...... >> <property> >> <name>storage.data.store.class</name> >> <value>org.apache.gora.avro.store.AvroStore</value> >> </property> >> ...... >> >> cd ~/nutch/apache-nutch-2.2/ >> ant job >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/ >> >> >> bin/nutch inject urls -crawlId crawl1 >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting at >> 2013-06-27 17:12:01 >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: >> urls >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using class >> org.apache.gora.memory.store.MemStore as the Gora storage class. >> >> It still shows me MemStore. >> >> In the jobtracker I see a [crawl1]inject urls job does not have >> urls_injected property >> I have a *db.score.injected* 1.0, but dont think that is anything to say >> about urls injected. >> >> >> >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney < >> [email protected]> wrote: >> >>> Hi, >>> Please re-read my mail. >>> If you are using the deploy directory e.g. running on a hadoop cluster, >>> then make sure to edit nutch-site.xml from within the top level conf >>> directory _not_ the conf directory in runtime/local. >>> If you look at the ant runtime target in the build script you will see >>> the >>> code which generates the runtime directory structure. >>> Make changes to conf/nutch-site.xml, build the job jar, navigate to >>> runtime/deploy, run the code. >>> It's easier to make the job jar and scripts in deploy available to the >>> job >>> tracker. >>> You also didn't comment on the counters for the inject job. Do you see >>> any? >>> Best >>> Lewis >>> >>> On Wednesday, June 26, 2013, h b <[email protected]> wrote: >>> > Here is an example of what I am saying about the config changes not >>> taking >>> > effect. >>> > >>> > cd runtime/deploy >>> > cat ../local/conf/nutch-site.xml >>> > ...... >>> > >>> > <property> >>> > <name>storage.data.store.class</name> >>> > <value>org.apache.gora.avro.store.AvroStore</value> >>> > </property> >>> > ..... >>> > >>> > cd ../.. >>> > >>> > ant job >>> > >>> > cd runtime/deploy >>> > bin/nutch inject urls -crawlId crawl1 >>> > ..... >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using class >>> > org.apache.gora.memory.store.MemStore as the Gora storage class. >>> > ..... >>> > >>> > So the nutch-site.xml was changed to use AvroStore as storage class and >>> job >>> > was rebuilt, and I reran inject, the output of which still shows that >>> it >>> is >>> > trying to use Memstore. >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney < >>> > [email protected]> wrote: >>> > >>> >> The Gora MemStore was introduced to deal predominantly with test >>> scenarios. >>> >> This is justified as the 2.x code is pulled nightly and after every >>> commit >>> >> and tested. >>> >> It is nnot thread safe and should not be used (until we fix some >>> issues) >>> >> for any kind of serious deployment. >>> >> From your inject task on the job tracker, you will be able to see >>> >> 'urls_injected' counters which represent the number of urls actually >>> >> persisted through Gora into the datastore. >>> >> I understand that HBase is not an option. Gora should also support >>> writing >>> >> the output into Avro sequence files... which can be pumped into hdfs. >>> We >>> >> have done some work on this so I suppose that right now is as good a >>> time >>> >> as any for you to try it out. >>> >> use the default datastore as org.apache.gora.avro.store.AvroStore I >>> think. >>> >> You can double check by looking into gora.properties >>> >> As a note, youu should use nutch-site.xml within the top level conf >>> >> directory for all your Nutch configuration. You should then create a >>> new >>> >> job jar for use in hadoop by calling 'ant job' after the changes are >>> made. >>> >> hth >>> >> Lewis >>> >> >>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote: >>> >> > The quick responses flowing are very encouraging. Thanks Tejas. >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it step by >>> step. >>> >> > >>> >> > So first I ran the inject command and then the readdb with dump >>> option >>> >> and >>> >> > did not see anything in the dump files, that leads me to say that >>> the >>> >> > inject did not work.I verified the regex-urlfilter and made sure >>> that >>> my >>> >> > url is not getting filtered. >>> >> > >>> >> > I agree that the second link is about configuring HBase as a >>> storageDB. >>> >> > However, I do not have Hbase installed and dont foresee getting it >>> >> > installed any sooner, hence using HBase for storage is not a option, >>> so I >>> >> > am going to have to stick to Gora with memory store. >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil < >>> [email protected] >>> >> >wrote: >>> >> > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> wrote: >>> >> >> >>> >> >> > Thanks for the response Lewis. >>> >> >> > I did read these links, I mostly followed the first link and >>> tried >>> >> both >>> >> >> the >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null pointer >>> >> exception >>> >> >> on >>> >> >> > solr, so I figured that I should first deal with getting the >>> crawl >>> >> part >>> >> >> to >>> >> >> > work and then deal with solr indexing. Hence I went back to >>> trying >>> it >>> >> >> > stepwise. >>> >> >> > >>> >> >> >>> >> >> You should try running the crawl using individual commands and see >>> where >>> >> >> the problem is. The nutch tutorial which Lewis pointed you to had >>> those >>> >> >> commands. Even peeking into the bin/crawl script would also help >>> as it >>> >> >> calls the nutch commands. >>> >> >> >>> >> >> > >>> >> >> > As for the second link, it is more about using HBase as store >>> instead >>> >> of >>> >> >> > gora. This is not really a option for me yet, cause my grid does >>> not >>> >> have >>> >> >> > hbase installed yet. Getting it done is not much under my control >>> >> >> > >>> >> >> >>> >> >> HBase is one of the datastores supported by Apache Gora. That >>> tutorial >>> >> >> speaks about how to configure Nutch (actually Gora) to use HBase >>> as a >>> >> >> backend. So, its wrong to say that the tutorial was about HBase and >>> not >>> >> >> Gora. >>> >> >> >>> >> >> > >>> >> >> > the FAQ link is the one I had not gone through until I checked >>> your >>> >> >> > response, but I do not find answers to any of my questions >>> >> >> > (directly/indirectly) in it. >>> >> >> > >>> >> >> >>> >> >> Ok >>> >> >> >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < >>> >> >> > > *Lewis* >>> >> >>> > >>> >>> -- >>> *Lewis* >>> >> >> >

