Binoy, I copy over the schema.XML file for solr from nutch's config dir. Re-run the command and it work.
Thank you very much for your help. Your advice is greatly appreciate it. Tom On Tue, Feb 23, 2016 at 3:55 AM, Binoy Dalal <[email protected]> wrote: > There's some issues with your schema.XML file for solr. > Did you copy over the schema file from nutch's config dir to your solr > core's conf? > > As you can see from the unknown field error, the field mera_description is > missing from you solr schema, and so when a document which this field is > indexed by solr, solr doesn't recognise the meta_description field and > throws an error. > > So either create this field or copy over nutch's schema file to you solr > core's conf. > > On Tue, 23 Feb 2016, 14:08 Tom Running <[email protected]> wrote: > > > Here is the solr and nutch log files. I found a few errors in these logs > > not quite sure how to fix it. > > Perhaps, I am not quite understand how Nutch, Solr and Hbase work > together > > that why it is so difficult to get them to work together correctly. > > > > How does these three packages working together? This is how I understand > > it. Please correct me if I am not in the right track. > > > > Use Nutch 2.3.1 to Crawl data, using hbase (.98.8) as a database for > > Nutch's db and Nutch's crawling contents, using nutch solrindex > > http://localhost:8983/solr/ -all to tell SOLR (4.10.3) to index > > Nutch's crawl data that resided in the Hbase database? > > > > How does SOLR know where to get it data. In this case, the data that we > > want SORL to use is in the Hbase table. Do I have to perform a POST that > > point to the Hbase or some sort?? > > > > > > Thank you for looking to this problem. I am going crazy (-: > > > > > > > > *********** SOLR's log file ************************** > > > > INFO - 2016-02-23 02:47:15.951; > org.apache.solr.core.QuerySenderListener; > > QuerySenderListener sending requests to Searcher@6ff86671[collection1] > > main{StandardDir > > ectoryReader(segments_1:1:nrt)} > > INFO - 2016-02-23 02:47:15.951; > org.apache.solr.core.QuerySenderListener; > > QuerySenderListener done. > > INFO - 2016-02-23 02:47:15.951; org.apache.solr.core.SolrCore; > > [collection1] Registered new searcher Searcher@6ff86671[collection1] > > main{StandardDirectoryReader( > > segments_1:1:nrt)} > > INFO - 2016-02-23 02:47:15.952; org.apache.solr.core.SolrCore; > > [collection1] Closing main searcher on request. > > INFO - 2016-02-23 02:47:15.953; > > org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null > > path=/admin/cores params={action=RELOAD&_=1456213620483&core=coll > > ection1&wt=json} status=0 QTime=1375 > > INFO - 2016-02-23 02:47:15.986; > > org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null > > path=/admin/cores params={_=1456213621893&wt=json} status=0 QTime > > =1 > > [root@localhost logs]# 1959432 [qtp969637605-11] INFO > > org.apache.solr.update.processor.LogUpdateProcessor â [collection1] > > webapp=/solr path=/update par ams={wt=javabin&version=2} {} 0 > > 55 > > 1959433 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore â > > org.apache.solr.common.SolrException: > > * ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip > > tion' at * > > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) > > at > > > > > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) > > > > > > at > > > > > org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) > > at > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) > > at > > > > > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) > > at java.lang.Thread.run(Thread.java:745) > > > > 1959465 [qtp969637605-11] INFO > > org.apache.solr.update.processor.LogUpdateProcessor â [collection1] > > webapp=/solr path=/update params={wt=javabin&version=2} {} 0 > 2 > > 1959466 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore â > > org.apache.solr.common.SolrException: > > > > > > *ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip > > tion' at > > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) > > at > > > > > org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)* > > > > > > > > > > > > **************************************************************** > > > > NUTCH (hadoop.log) > > > > 2016-02-23 02:58:40,659 INFO anchor.AnchorIndexingFilter - Anchor > > deduplication is: off > > 2016-02-23 02:58:40,659 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > > 2016-02-23 02:58:40,659 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.metadata.MetadataIndexer > > 2016-02-23 02:58:40,701 > > * WARN store.HBaseStore - Mismatching schema's names. Mappingfile > schema: > > 'webpage'. PersistentClass schema's name: 'webpage_webpage'Assuming they > > are the same.* > > 2016-02-23 02:58:41,158 INFO solr.SolrIndexWriter - Adding 1 documents > > 2016-02-23 02:58:41,574 INFO solr.SolrIndexWriter - Adding 1 documents > > 2016-02-23 02:58:41,614 WARN mapred.LocalJobRunner - > > job_local1682084779_0001 > > java.lang.Exception: > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > *ERROR: [doc=com.galco.www:http] unknown field 'meta_description' > at > > * > > > > > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > > at > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) > > Caused by: > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > ERROR: [doc=com.galco.www:http] unknown field 'meta_description' > > at > > > > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) > > at > > > > > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) > > at > > > > > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > > at > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) > > at > org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) > > at > > > > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97) > > at > > org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114) > > at > > > > > org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) > > at > > > > > org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647) > > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > > at > > > > > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) > > at > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > > at > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > 2016-02-23 02:58:41,653* ERROR indexer.IndexingJob - SolrIndexerJob: > > java.lang.RuntimeException: job failed: name=[webpage]Indexer, > > jobid=job_local1682084779_0001* > > at > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) > > at > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) > > > > > > > > On Mon, Feb 22, 2016 at 9:52 PM, Binoy Dalal <[email protected]> > > wrote: > > > > > What errors do you see in hadoop.log and solr's solr.log? > > > Post that stack trace. > > > > > > On Tue, 23 Feb 2016, 07:29 Tom Running <[email protected]> wrote: > > > > > > > Got errors: when run this command: > > > > ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1 > > > > Have any idea where to go from here? > > > > > > > > thank you. > > > > Tom > > > > > > > > > > > > > > > > ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11 > > > > CrawlDB update for TestCrawler > > > > /home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 > > -D > > > > mapred.child.java.opts=-Xmx1000m -D > > > > mapred.reduce.tasks.speculative.execution=false -D > > > > mapred.map.tasks.speculative.execution=false -D > > > > mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler > > > > DbUpdaterJob: starting at 2016-02-22 20:22:49 > > > > DbUpdaterJob: batchId: 1456190521-30847 > > > > DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09 > > > > Indexing TestCrawler on SOLR index -> http://localhost:8983/solr > > > > /home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D > > > > mapred.child.java.opts=-Xmx1000m -D > > > > mapred.reduce.tasks.speculative.execution=false -D > > > > mapred.map.tasks.speculative.execution=false -D > > > > mapred.compress.map.output=true -D solr.server.url= > > > > http://localhost:8983/solr -all -crawlId TestCrawler > > > > IndexingJob: starting > > > > SolrIndexerJob: java.lang.RuntimeException: job failed: > > > > name=[TestCrawler]Indexer, jobid=job_local1592190856_0001 > > > > at > > > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) > > > > at > > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) > > > > at > > > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) > > > > at > > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > > > at > > > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) > > > > > > > > > > > > > > > > > > > > *Error running: /home/nutch/runtime/local/bin/nutch index -D > > > > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > > > > mapred.reduce.tasks.speculative.execution=false -D > > > > mapred.map.tasks.speculative.execution=false -D > > > > mapred.compress.map.output=true -D > > > > solr.server.url=http://localhost:8983/solr < > http://localhost:8983/solr > > > > > > > -all -crawlId TestCrawlerFailed with exit value 255.* > > > > > > > > On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <[email protected] > > > > > > wrote: > > > > > > > > > CrawlID put any number > > > > > Number of rounds >=1 > > > > > Seed dir and solr URL is proper > > > > > > > > > > On Mon, 22 Feb 2016, 12:17 Tom Running <[email protected]> > > wrote: > > > > > > > > > > > Binoy, > > > > > > > > > > > > I do see the information on the console and also lot of > information > > > in > > > > > > hbase. > > > > > > > > > > > > I tried ./crawl but not quite sure where to location the > following > > > > > > information: > > > > > > > > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds> > > > > > > > > > > > > seedDir ../urls/seed.txt ? > > > > > > crawID ? > > > > > > solrUrl I am guessing this will be > http://localhost:8983/solr/ > > > > > > numberOfRounds ? > > > > > > > > > > > > Could you provide some advice on how to determine the above > > > > information. > > > > > > > > > > > > > > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds> > > > > > > > > > > > > Thanks, > > > > > > Tom > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > When you run the inject and generate commands, in the console > > > output > > > > do > > > > > > you > > > > > > > see your site being added? > > > > > > > Also while fetching and parsing you should be able to see the > > > number > > > > of > > > > > > > successful fetches and parse actions in your console. Ideally > > this > > > > > should > > > > > > > be equal to or more than the number of sites you've put in the > > > > seed.txt > > > > > > > file. > > > > > > > If this is not the case then there is some issue with either > your > > > > > > seed.txt > > > > > > > file or the regex-urlfilter file. > > > > > > > > > > > > > > While running the crawl command, you doing need to index to > solr > > > > > > > separately. The command will do it for you. > > > > > > > Run ./crawl to see usage instructions. > > > > > > > > > > > > > > On Mon, 22 Feb 2016, 11:41 Tom Running <[email protected]> > > > > wrote: > > > > > > > > > > > > > > > Yes, I did ran these before run ./nutch solrindex > > > > > > > > http://localhost:8983/solr/ -all and get nothing. > > > > > > > > > > > > > > > > > > > > > > > > From /home/nutch/runtime/local/bin/ > > > > > > > > > > > > > > > > ./nutch inject ../urls/seed.txt > > > > > > > > ./nutch readdb > > > > > > > > ./nutch generate -topN 2500 > > > > > > > > ./nutch fetch -all > > > > > > > > ./nutch parse -all > > > > > > > > ./nutch updatedb > > > > > > > > > > > > > > > > Did not run the crawl command. > > > > > > > > > > > > > > > > Would I just run ./crawl ?? > > > > > > > > then run this again ./nutch solrindex > > > http://localhost:8983/solr/ > > > > > -all > > > > > > > > > > > > > > > > Thank you very much for response to my questions. > > > > > > > > > > > > > > > > Tom > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal < > > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Just to be clear, you did run the preceding nutch commands > to > > > > > inject, > > > > > > > > > generate, fetch and parse the URLs right? > > > > > > > > > > > > > > > > > > Additionally try with the ./crawl command to directly crawl > > and > > > > > index > > > > > > > > > everything to solr without having to manually run all the > > > steps. > > > > > > > > > > > > > > > > > > On Mon, 22 Feb 2016, 07:24 Tom Running < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > I am trying to get Nutch to run solrindex and having > > problem. > > > > I > > > > > am > > > > > > > > using > > > > > > > > > > the following instruction from > > > > > > > > > > this document > http://wiki.apache.org/nutch/Nutch2Tutorial. > > > > > > > Everything > > > > > > > > > > are working except when I ran the following command. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *./nutch solrindex http://localhost:8983/solr < > > > > > > > > > http://localhost:8983/solr> > > > > > > > > > > -all* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ****** it came back with the following info ***** > > > > > > > > > > ****** It seems to have problem with indexing **** > > > > > > > > > > IndexingJob: starting > > > > > > > > > > Active IndexWriters : > > > > > > > > > > SOLRIndexWriter > > > > > > > > > > solr.server.url : URL of the SOLR instance > > > (mandatory) > > > > > > > > > > solr.commit.size : buffer size when sending to > SOLR > > > > > > (default > > > > > > > > > 1000) > > > > > > > > > > solr.mapping.file : name of the mapping file for > > > fields > > > > > > > > (default > > > > > > > > > > solrindex-mapping.xml) > > > > > > > > > > solr.auth : use authentication* (default false)* > > > > > > > > > > solr.auth.username : username for authentication > > > > > > > > > > solr.auth.password : password for authentication > > > > > > > > > > IndexingJob: done. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When I launch the SOLR Web UI interface can not query or > > find > > > > any > > > > > > > > things > > > > > > > > > > under the default collection1 or the > > > > > gettingstarted_shard1_replica1 > > > > > > > or > > > > > > > > > > gettingstarted_shard2_replica1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have also tried with this option (with the colletion1) > > and > > > > > still > > > > > > > not > > > > > > > > > > able to query anything. > > > > > > > > > > ./nutch solrindex http://localhost:8983/solr/collection1 > > > -all > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > After download SOLR 4.10.3 and start it as it with > command > > > > > > > > > > /home/solr/bin/solr start -e cloud -noprompt > > > > > > > > > > > > > > > > > > > > I did not modify any configuration file not posting any > > file > > > or > > > > > > > > directory > > > > > > > > > > from within SOLR. I am assuming this command ./nutch > > > solrindex > > > > > > > > > > http://localhost:8983/solr/collection1 will do all the > > > posting > > > > > and > > > > > > > > index > > > > > > > > > > for SOLR. > > > > > > > > > > > > > > > > > > > > Any ideas what am I missing here. Any advice where to go > > > from > > > > > here > > > > > > > > > would > > > > > > > > > > be greatly appreciate. > > > > > > > > > > > > > > > > > > > > I Did tried copy /nutch/runtime/local/conf/*.* into > SOLR > > > and > > > > it > > > > > > did > > > > > > > > not > > > > > > > > > > make any different. > > > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > Tom > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Regards, > > > > > > > > > Binoy Dalal > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > Binoy Dalal > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > Binoy Dalal > > > > > > > > > > > > -- > > > Regards, > > > Binoy Dalal > > > > > > -- > Regards, > Binoy Dalal >

