Here is the solr and nutch log files. I found a few errors in these logs not quite sure how to fix it. Perhaps, I am not quite understand how Nutch, Solr and Hbase work together that why it is so difficult to get them to work together correctly.
How does these three packages working together? This is how I understand it. Please correct me if I am not in the right track. Use Nutch 2.3.1 to Crawl data, using hbase (.98.8) as a database for Nutch's db and Nutch's crawling contents, using nutch solrindex http://localhost:8983/solr/ -all to tell SOLR (4.10.3) to index Nutch's crawl data that resided in the Hbase database? How does SOLR know where to get it data. In this case, the data that we want SORL to use is in the Hbase table. Do I have to perform a POST that point to the Hbase or some sort?? Thank you for looking to this problem. I am going crazy (-: *********** SOLR's log file ************************** INFO - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener; QuerySenderListener sending requests to Searcher@6ff86671[collection1] main{StandardDir ectoryReader(segments_1:1:nrt)} INFO - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener; QuerySenderListener done. INFO - 2016-02-23 02:47:15.951; org.apache.solr.core.SolrCore; [collection1] Registered new searcher Searcher@6ff86671[collection1] main{StandardDirectoryReader( segments_1:1:nrt)} INFO - 2016-02-23 02:47:15.952; org.apache.solr.core.SolrCore; [collection1] Closing main searcher on request. INFO - 2016-02-23 02:47:15.953; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={action=RELOAD&_=1456213620483&core=coll ection1&wt=json} status=0 QTime=1375 INFO - 2016-02-23 02:47:15.986; org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null path=/admin/cores params={_=1456213621893&wt=json} status=0 QTime =1 [root@localhost logs]# 1959432 [qtp969637605-11] INFO org.apache.solr.update.processor.LogUpdateProcessor â [collection1] webapp=/solr path=/update par ams={wt=javabin&version=2} {} 0 55 1959433 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: * ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip tion' at * org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) 1959465 [qtp969637605-11] INFO org.apache.solr.update.processor.LogUpdateProcessor â [collection1] webapp=/solr path=/update params={wt=javabin&version=2} {} 0 2 1959466 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore â org.apache.solr.common.SolrException: *ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip tion' at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)* **************************************************************** NUTCH (hadoop.log) 2016-02-23 02:58:40,659 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2016-02-23 02:58:40,659 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2016-02-23 02:58:40,659 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2016-02-23 02:58:40,701 * WARN store.HBaseStore - Mismatching schema's names. Mappingfile schema: 'webpage'. PersistentClass schema's name: 'webpage_webpage'Assuming they are the same.* 2016-02-23 02:58:41,158 INFO solr.SolrIndexWriter - Adding 1 documents 2016-02-23 02:58:41,574 INFO solr.SolrIndexWriter - Adding 1 documents 2016-02-23 02:58:41,614 WARN mapred.LocalJobRunner - job_local1682084779_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: *ERROR: [doc=com.galco.www:http] unknown field 'meta_description' at * org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=com.galco.www:http] unknown field 'meta_description' at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54) at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97) at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-02-23 02:58:41,653* ERROR indexer.IndexingJob - SolrIndexerJob: java.lang.RuntimeException: job failed: name=[webpage]Indexer, jobid=job_local1682084779_0001* at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) On Mon, Feb 22, 2016 at 9:52 PM, Binoy Dalal <[email protected]> wrote: > What errors do you see in hadoop.log and solr's solr.log? > Post that stack trace. > > On Tue, 23 Feb 2016, 07:29 Tom Running <[email protected]> wrote: > > > Got errors: when run this command: > > ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1 > > Have any idea where to go from here? > > > > thank you. > > Tom > > > > > > > > ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11 > > CrawlDB update for TestCrawler > > /home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D > > mapred.child.java.opts=-Xmx1000m -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler > > DbUpdaterJob: starting at 2016-02-22 20:22:49 > > DbUpdaterJob: batchId: 1456190521-30847 > > DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09 > > Indexing TestCrawler on SOLR index -> http://localhost:8983/solr > > /home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D > > mapred.child.java.opts=-Xmx1000m -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true -D solr.server.url= > > http://localhost:8983/solr -all -crawlId TestCrawler > > IndexingJob: starting > > SolrIndexerJob: java.lang.RuntimeException: job failed: > > name=[TestCrawler]Indexer, jobid=job_local1592190856_0001 > > at > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) > > at > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > > at > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211) > > > > > > > > > > *Error running: /home/nutch/runtime/local/bin/nutch index -D > > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D > > mapred.reduce.tasks.speculative.execution=false -D > > mapred.map.tasks.speculative.execution=false -D > > mapred.compress.map.output=true -D > > solr.server.url=http://localhost:8983/solr <http://localhost:8983/solr> > > -all -crawlId TestCrawlerFailed with exit value 255.* > > > > On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <[email protected]> > > wrote: > > > > > CrawlID put any number > > > Number of rounds >=1 > > > Seed dir and solr URL is proper > > > > > > On Mon, 22 Feb 2016, 12:17 Tom Running <[email protected]> wrote: > > > > > > > Binoy, > > > > > > > > I do see the information on the console and also lot of information > in > > > > hbase. > > > > > > > > I tried ./crawl but not quite sure where to location the following > > > > information: > > > > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds> > > > > > > > > seedDir ../urls/seed.txt ? > > > > crawID ? > > > > solrUrl I am guessing this will be http://localhost:8983/solr/ > > > > numberOfRounds ? > > > > > > > > Could you provide some advice on how to determine the above > > information. > > > > > > > > > > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds> > > > > > > > > Thanks, > > > > Tom > > > > > > > > > > > > > > > > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal <[email protected] > > > > > > wrote: > > > > > > > > > When you run the inject and generate commands, in the console > output > > do > > > > you > > > > > see your site being added? > > > > > Also while fetching and parsing you should be able to see the > number > > of > > > > > successful fetches and parse actions in your console. Ideally this > > > should > > > > > be equal to or more than the number of sites you've put in the > > seed.txt > > > > > file. > > > > > If this is not the case then there is some issue with either your > > > > seed.txt > > > > > file or the regex-urlfilter file. > > > > > > > > > > While running the crawl command, you doing need to index to solr > > > > > separately. The command will do it for you. > > > > > Run ./crawl to see usage instructions. > > > > > > > > > > On Mon, 22 Feb 2016, 11:41 Tom Running <[email protected]> > > wrote: > > > > > > > > > > > Yes, I did ran these before run ./nutch solrindex > > > > > > http://localhost:8983/solr/ -all and get nothing. > > > > > > > > > > > > > > > > > > From /home/nutch/runtime/local/bin/ > > > > > > > > > > > > ./nutch inject ../urls/seed.txt > > > > > > ./nutch readdb > > > > > > ./nutch generate -topN 2500 > > > > > > ./nutch fetch -all > > > > > > ./nutch parse -all > > > > > > ./nutch updatedb > > > > > > > > > > > > Did not run the crawl command. > > > > > > > > > > > > Would I just run ./crawl ?? > > > > > > then run this again ./nutch solrindex > http://localhost:8983/solr/ > > > -all > > > > > > > > > > > > Thank you very much for response to my questions. > > > > > > > > > > > > Tom > > > > > > > > > > > > > > > > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Just to be clear, you did run the preceding nutch commands to > > > inject, > > > > > > > generate, fetch and parse the URLs right? > > > > > > > > > > > > > > Additionally try with the ./crawl command to directly crawl and > > > index > > > > > > > everything to solr without having to manually run all the > steps. > > > > > > > > > > > > > > On Mon, 22 Feb 2016, 07:24 Tom Running <[email protected]> > > > > wrote: > > > > > > > > > > > > > > > I am trying to get Nutch to run solrindex and having problem. > > I > > > am > > > > > > using > > > > > > > > the following instruction from > > > > > > > > this document http://wiki.apache.org/nutch/Nutch2Tutorial. > > > > > Everything > > > > > > > > are working except when I ran the following command. > > > > > > > > > > > > > > > > > > > > > > > > *./nutch solrindex http://localhost:8983/solr < > > > > > > > http://localhost:8983/solr> > > > > > > > > -all* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ****** it came back with the following info ***** > > > > > > > > ****** It seems to have problem with indexing **** > > > > > > > > IndexingJob: starting > > > > > > > > Active IndexWriters : > > > > > > > > SOLRIndexWriter > > > > > > > > solr.server.url : URL of the SOLR instance > (mandatory) > > > > > > > > solr.commit.size : buffer size when sending to SOLR > > > > (default > > > > > > > 1000) > > > > > > > > solr.mapping.file : name of the mapping file for > fields > > > > > > (default > > > > > > > > solrindex-mapping.xml) > > > > > > > > solr.auth : use authentication* (default false)* > > > > > > > > solr.auth.username : username for authentication > > > > > > > > solr.auth.password : password for authentication > > > > > > > > IndexingJob: done. > > > > > > > > > > > > > > > > > > > > > > > > When I launch the SOLR Web UI interface can not query or find > > any > > > > > > things > > > > > > > > under the default collection1 or the > > > gettingstarted_shard1_replica1 > > > > > or > > > > > > > > gettingstarted_shard2_replica1 > > > > > > > > > > > > > > > > > > > > > > > > I have also tried with this option (with the colletion1) and > > > still > > > > > not > > > > > > > > able to query anything. > > > > > > > > ./nutch solrindex http://localhost:8983/solr/collection1 > -all > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > After download SOLR 4.10.3 and start it as it with command > > > > > > > > /home/solr/bin/solr start -e cloud -noprompt > > > > > > > > > > > > > > > > I did not modify any configuration file not posting any file > or > > > > > > directory > > > > > > > > from within SOLR. I am assuming this command ./nutch > solrindex > > > > > > > > http://localhost:8983/solr/collection1 will do all the > posting > > > and > > > > > > index > > > > > > > > for SOLR. > > > > > > > > > > > > > > > > Any ideas what am I missing here. Any advice where to go > from > > > here > > > > > > > would > > > > > > > > be greatly appreciate. > > > > > > > > > > > > > > > > I Did tried copy /nutch/runtime/local/conf/*.* into SOLR > and > > it > > > > did > > > > > > not > > > > > > > > make any different. > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > Tom > > > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > Binoy Dalal > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > Binoy Dalal > > > > > > > > > > > > -- > > > Regards, > > > Binoy Dalal > > > > > > -- > Regards, > Binoy Dalal >

