Ok, I tweaked the code a bit to extract the html as is from the parser, to realize that it is too much of a text and too much depth of crawling. So I am looking to see if I can somehow limit the depth. Nutch 1.x docs mention about the -depth parameter. However, I do not see this in the nutch-default.xml under Nutch 2.x. The -topN is used for number of links per depth. So for Nutch 2.x where/how do I set the depth?
On Fri, Jun 28, 2013 at 11:32 AM, h b <[email protected]> wrote: > Ok, SO i also got this work with Solr 4 no errors, I think the key was not > using a crawl id. > I had to comment the updatelog in solrconfig.xml because I got some > "_version_" related error. > > My next questions is, my solr document, or for that matter even the hbase > value of the html content is 'not html'. It appears that nutch is > extracting out text only. How do I retain the html content "as is". > > > > > > > On Fri, Jun 28, 2013 at 10:54 AM, Tejas Patil <[email protected]>wrote: > >> Kewl !! >> >> I wonder why "org.apache.solr.common.SolrException: undefined field text" >> happens.. Anybody who can throw light on this ? >> >> >> On Fri, Jun 28, 2013 at 10:45 AM, h b <[email protected]> wrote: >> >> > Thanks Tejas >> > I tried these steps, One step I added, was updatedb >> > >> > *bin/nutch updatedb* >> > >> > Just to be consistent with the doc, and your suggestion on some other >> > thread, I used solr 3.6 instead of 4.x >> > I copied the schema.xml from nutch/conf (rootlevel) and started solr. It >> > failed with >> > >> > SEVERE: org.apache.solr.common.SolrException: undefined field text >> > >> > >> > One of the google thread, suggested I ignore this error, so I ignored >> and >> > indexed anyway >> > >> > So now I got it to work. Playing some more with the queries >> > >> > >> > >> > >> > On Fri, Jun 28, 2013 at 9:52 AM, Tejas Patil <[email protected] >> > >wrote: >> > >> > > The "storage.schema.webpage" seems messed up but I don't have ample >> time >> > > now to look into it. Here is what I would suggest to get things >> working: >> > > * >> > > * >> > > *[1] Remove all the old data from HBase* >> > > >> > > (I assume that HBase is running while you do this) >> > > *cd $HBASE_HOME* >> > > *./bin/hbase shell >> > > * >> > > In the HBase shell, use "list" to see all the tables, delete all of >> those >> > > related to Nutch (ones named as *webpage). >> > > Remove them using "disable" and "drop" commands. >> > > >> > > eg. if one of the tables is "webpage", you would run this: >> > > *disable 'webpage' >> > > * >> > > *drop 'webpage'* >> > > * * >> > > >> > > *[2] Run crawl* >> > > I assume that you have not changed "storage.schema.webpage" is >> > > nutch-site.xml and nutch-default.xml. If yes, revert it to: >> > > >> > > *<property>* >> > > * <name>storage.schema.webpage</**name>* >> > > * <value>webpage</value>* >> > > * <description>This value holds the schema name used for Nutch web >> db.* >> > > * Note that Nutch ignores the value in the gora mapping files, and >> uses* >> > > * this as the webpage schema name.* >> > > * </description>* >> > > *</property>* >> > > >> > > Run crawl commands: >> > > *bin/nutch inject urls/* >> > > *bin/nutch generate -topN 50000 -noFilter -adddays 0* >> > > *bin/nutch fetch -all -threads 5 * >> > > *bin/nutch parse -all * >> > > >> > > *[3] Perform indexing* >> > > I assume that you have Solr setup and NUTCH_HOME/conf/schema.xml >> copied >> > in >> > > ${SOLR_HOME}/example/solr/conf/. See bullets 4-6 in [0] for details. >> > > Start solr and run the indexing command: >> > > *bin/nutch solrindex $SOLR_URL -all * >> > > >> > > [0] : http://wiki.apache.org/nutch/NutchTutorial >> > > >> > > Thanks, >> > > Tejas >> > > >> > > On Thu, Jun 27, 2013 at 1:47 PM, h b <[email protected]> wrote: >> > > >> > > > Ok, so avro did not work quite well for me, I got a test grid with >> > hbase, >> > > > and I started using that for now. All steps ran without errors and I >> > see >> > > my >> > > > crawled doc in hbase. >> > > > However, after running the solr integration, and querying solr, I >> get >> > > back >> > > > nothing. Index files look very tiny. The one thing I noted is a >> message >> > > > during almost every step >> > > > >> > > > 13/06/27 20:37:53 INFO store.HBaseStore: Keyclass and nameclass >> match >> > but >> > > > mismatching table names mappingfile schema is 'webpage' vs actual >> > schema >> > > > 'crawl2_webpage' , assuming they are the same. >> > > > >> > > > This looks suspicious and I think this is the one causing the solr >> > index >> > > to >> > > > be empty. Googling suggested I should edit the nutch-default,xml, I >> > tried >> > > > and rebuilt the job but no luck with this message. >> > > > >> > > > >> > > > >> > > > On Thu, Jun 27, 2013 at 10:30 AM, h b <[email protected]> wrote: >> > > > >> > > > > Ok, I ran a ant, ant jar and ant job and that seems to have >> picked up >> > > the >> > > > > config changes. >> > > > > Now, the inject output shows that it is using AvroStore as Gora >> > > storage. >> > > > > >> > > > > Now I am getting Nullpointer on >> > > > > >> > > > > java.lang.NullPointerException >> > > > > at >> > > > > >> > > > >> > > >> > >> org.apache.gora.mapreduce.GoraOutputFormat.setOutputPath(GoraOutputFormat.java:70) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:91) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:521) >> > > > > at >> > > > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:636) >> > > > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) >> > > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:270) >> > > > > at java.security.AccessController.doPrivileged(Native >> Method) >> > > > > at javax.security.auth.Subject.doAs(Subject.java:396) >> > > > > at >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) >> > > > > at org.apache.hadoop.mapred.Child.main(Child.java:264) >> > > > > >> > > > > which does not look like nutch related. I will work on this and >> write >> > > > back >> > > > > if I get stuck on something else, or will write back if I succeed. >> > > > > >> > > > > >> > > > > On Thu, Jun 27, 2013 at 10:18 AM, h b <[email protected]> wrote: >> > > > > >> > > > >> Hi Lewis, >> > > > >> >> > > > >> Sorry for missing that one. So I update the top level conf and >> > rebuild >> > > > >> the job. >> > > > >> >> > > > >> cat ~/nutch/apache-nutch-2.2/conf/nutch-site.xml >> > > > >> >> > > > >> ...... >> > > > >> <property> >> > > > >> <name>storage.data.store.class</name> >> > > > >> <value>org.apache.gora.avro.store.AvroStore</value> >> > > > >> </property> >> > > > >> ...... >> > > > >> >> > > > >> cd ~/nutch/apache-nutch-2.2/ >> > > > >> ant job >> > > > >> cd ~/nutch/apache-nutch-2.2/runtime/deploy/ >> > > > >> >> > > > >> >> > > > >> bin/nutch inject urls -crawlId crawl1 >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: starting >> at >> > > > >> 2013-06-27 17:12:01 >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Injecting >> > > urlDir: >> > > > >> urls >> > > > >> 13/06/27 17:12:01 INFO crawl.InjectorJob: InjectorJob: Using >> class >> > > > >> org.apache.gora.memory.store.MemStore as the Gora storage class. >> > > > >> >> > > > >> It still shows me MemStore. >> > > > >> >> > > > >> In the jobtracker I see a [crawl1]inject urls job does not have >> > > > >> urls_injected property >> > > > >> I have a *db.score.injected* 1.0, but dont think that is >> anything to >> > > say >> > > > >> about urls injected. >> > > > >> >> > > > >> >> > > > >> >> > > > >> On Thu, Jun 27, 2013 at 7:09 AM, Lewis John Mcgibbney < >> > > > >> [email protected]> wrote: >> > > > >> >> > > > >>> Hi, >> > > > >>> Please re-read my mail. >> > > > >>> If you are using the deploy directory e.g. running on a hadoop >> > > cluster, >> > > > >>> then make sure to edit nutch-site.xml from within the top level >> > conf >> > > > >>> directory _not_ the conf directory in runtime/local. >> > > > >>> If you look at the ant runtime target in the build script you >> will >> > > see >> > > > >>> the >> > > > >>> code which generates the runtime directory structure. >> > > > >>> Make changes to conf/nutch-site.xml, build the job jar, >> navigate to >> > > > >>> runtime/deploy, run the code. >> > > > >>> It's easier to make the job jar and scripts in deploy available >> to >> > > the >> > > > >>> job >> > > > >>> tracker. >> > > > >>> You also didn't comment on the counters for the inject job. Do >> you >> > > see >> > > > >>> any? >> > > > >>> Best >> > > > >>> Lewis >> > > > >>> >> > > > >>> On Wednesday, June 26, 2013, h b <[email protected]> wrote: >> > > > >>> > Here is an example of what I am saying about the config >> changes >> > not >> > > > >>> taking >> > > > >>> > effect. >> > > > >>> > >> > > > >>> > cd runtime/deploy >> > > > >>> > cat ../local/conf/nutch-site.xml >> > > > >>> > ...... >> > > > >>> > >> > > > >>> > <property> >> > > > >>> > <name>storage.data.store.class</name> >> > > > >>> > <value>org.apache.gora.avro.store.AvroStore</value> >> > > > >>> > </property> >> > > > >>> > ..... >> > > > >>> > >> > > > >>> > cd ../.. >> > > > >>> > >> > > > >>> > ant job >> > > > >>> > >> > > > >>> > cd runtime/deploy >> > > > >>> > bin/nutch inject urls -crawlId crawl1 >> > > > >>> > ..... >> > > > >>> > 13/06/27 06:34:29 INFO crawl.InjectorJob: InjectorJob: Using >> > class >> > > > >>> > org.apache.gora.memory.store.MemStore as the Gora storage >> class. >> > > > >>> > ..... >> > > > >>> > >> > > > >>> > So the nutch-site.xml was changed to use AvroStore as storage >> > class >> > > > and >> > > > >>> job >> > > > >>> > was rebuilt, and I reran inject, the output of which still >> shows >> > > that >> > > > >>> it >> > > > >>> is >> > > > >>> > trying to use Memstore. >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> > On Wed, Jun 26, 2013 at 11:05 PM, Lewis John Mcgibbney < >> > > > >>> > [email protected]> wrote: >> > > > >>> > >> > > > >>> >> The Gora MemStore was introduced to deal predominantly with >> test >> > > > >>> scenarios. >> > > > >>> >> This is justified as the 2.x code is pulled nightly and after >> > > every >> > > > >>> commit >> > > > >>> >> and tested. >> > > > >>> >> It is nnot thread safe and should not be used (until we fix >> some >> > > > >>> issues) >> > > > >>> >> for any kind of serious deployment. >> > > > >>> >> From your inject task on the job tracker, you will be able to >> > see >> > > > >>> >> 'urls_injected' counters which represent the number of urls >> > > actually >> > > > >>> >> persisted through Gora into the datastore. >> > > > >>> >> I understand that HBase is not an option. Gora should also >> > support >> > > > >>> writing >> > > > >>> >> the output into Avro sequence files... which can be pumped >> into >> > > > hdfs. >> > > > >>> We >> > > > >>> >> have done some work on this so I suppose that right now is as >> > > good a >> > > > >>> time >> > > > >>> >> as any for you to try it out. >> > > > >>> >> use the default datastore as >> > org.apache.gora.avro.store.AvroStore >> > > I >> > > > >>> think. >> > > > >>> >> You can double check by looking into gora.properties >> > > > >>> >> As a note, youu should use nutch-site.xml within the top >> level >> > > conf >> > > > >>> >> directory for all your Nutch configuration. You should then >> > > create a >> > > > >>> new >> > > > >>> >> job jar for use in hadoop by calling 'ant job' after the >> changes >> > > are >> > > > >>> made. >> > > > >>> >> hth >> > > > >>> >> Lewis >> > > > >>> >> >> > > > >>> >> On Wednesday, June 26, 2013, h b <[email protected]> wrote: >> > > > >>> >> > The quick responses flowing are very encouraging. Thanks >> > Tejas. >> > > > >>> >> > Tejas, as I mentioned earlier, in fact I actually ran it >> step >> > by >> > > > >>> step. >> > > > >>> >> > >> > > > >>> >> > So first I ran the inject command and then the readdb with >> > dump >> > > > >>> option >> > > > >>> >> and >> > > > >>> >> > did not see anything in the dump files, that leads me to >> say >> > > that >> > > > >>> the >> > > > >>> >> > inject did not work.I verified the regex-urlfilter and made >> > sure >> > > > >>> that >> > > > >>> my >> > > > >>> >> > url is not getting filtered. >> > > > >>> >> > >> > > > >>> >> > I agree that the second link is about configuring HBase as >> a >> > > > >>> storageDB. >> > > > >>> >> > However, I do not have Hbase installed and dont foresee >> > getting >> > > it >> > > > >>> >> > installed any sooner, hence using HBase for storage is not >> a >> > > > option, >> > > > >>> so I >> > > > >>> >> > am going to have to stick to Gora with memory store. >> > > > >>> >> > >> > > > >>> >> > >> > > > >>> >> > >> > > > >>> >> > >> > > > >>> >> > On Wed, Jun 26, 2013 at 10:02 PM, Tejas Patil < >> > > > >>> [email protected] >> > > > >>> >> >wrote: >> > > > >>> >> > >> > > > >>> >> >> On Wed, Jun 26, 2013 at 9:53 PM, h b <[email protected]> >> > wrote: >> > > > >>> >> >> >> > > > >>> >> >> > Thanks for the response Lewis. >> > > > >>> >> >> > I did read these links, I mostly followed the first link >> > and >> > > > >>> tried >> > > > >>> >> both >> > > > >>> >> >> the >> > > > >>> >> >> > 3.2 and 3.3 sections. Using the bin/crawl gave me null >> > > pointer >> > > > >>> >> exception >> > > > >>> >> >> on >> > > > >>> >> >> > solr, so I figured that I should first deal with getting >> > the >> > > > >>> crawl >> > > > >>> >> part >> > > > >>> >> >> to >> > > > >>> >> >> > work and then deal with solr indexing. Hence I went >> back to >> > > > >>> trying >> > > > >>> it >> > > > >>> >> >> > stepwise. >> > > > >>> >> >> > >> > > > >>> >> >> >> > > > >>> >> >> You should try running the crawl using individual commands >> > and >> > > > see >> > > > >>> where >> > > > >>> >> >> the problem is. The nutch tutorial which Lewis pointed >> you to >> > > had >> > > > >>> those >> > > > >>> >> >> commands. Even peeking into the bin/crawl script would >> also >> > > help >> > > > >>> as it >> > > > >>> >> >> calls the nutch commands. >> > > > >>> >> >> >> > > > >>> >> >> > >> > > > >>> >> >> > As for the second link, it is more about using HBase as >> > store >> > > > >>> instead >> > > > >>> >> of >> > > > >>> >> >> > gora. This is not really a option for me yet, cause my >> grid >> > > > does >> > > > >>> not >> > > > >>> >> have >> > > > >>> >> >> > hbase installed yet. Getting it done is not much under >> my >> > > > control >> > > > >>> >> >> > >> > > > >>> >> >> >> > > > >>> >> >> HBase is one of the datastores supported by Apache Gora. >> That >> > > > >>> tutorial >> > > > >>> >> >> speaks about how to configure Nutch (actually Gora) to use >> > > HBase >> > > > >>> as a >> > > > >>> >> >> backend. So, its wrong to say that the tutorial was about >> > HBase >> > > > and >> > > > >>> not >> > > > >>> >> >> Gora. >> > > > >>> >> >> >> > > > >>> >> >> > >> > > > >>> >> >> > the FAQ link is the one I had not gone through until I >> > > checked >> > > > >>> your >> > > > >>> >> >> > response, but I do not find answers to any of my >> questions >> > > > >>> >> >> > (directly/indirectly) in it. >> > > > >>> >> >> > >> > > > >>> >> >> >> > > > >>> >> >> Ok >> > > > >>> >> >> >> > > > >>> >> >> > >> > > > >>> >> >> > >> > > > >>> >> >> > >> > > > >>> >> >> > >> > > > >>> >> >> > On Wed, Jun 26, 2013 at 7:44 PM, Lewis John Mcgibbney < >> > > > >>> >> >> > > *Lewis* >> > > > >>> >> >> > > > >>> > >> > > > >>> >> > > > >>> -- >> > > > >>> *Lewis* >> > > > >>> >> > > > >> >> > > > >> >> > > > > >> > > > >> > > >> > >> > >

