Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Tom Running Tue, 23 Feb 2016 00:39:07 -0800

Here is the solr and nutch log files.  I found a few errors in these logs
not quite sure how to fix it.
Perhaps, I am not quite understand how Nutch, Solr and Hbase work together
that why it is so difficult to get them to work together correctly.


How does these three packages working together?  This is how I understand
it.  Please correct me if I am not in the right track.

Use Nutch 2.3.1  to Crawl data, using hbase (.98.8) as a database for
Nutch's db and Nutch's crawling contents, using nutch solrindex
http://localhost:8983/solr/ -all      to tell SOLR (4.10.3) to index
Nutch's crawl data that resided in the Hbase database?

How does SOLR know where to get it data. In this case, the data that we
want SORL to use is in the Hbase table.  Do I have to perform a POST that
point to the Hbase or some sort??


Thank you for looking to this problem.  I am going crazy (-:



***********  SOLR's log file **************************

INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener;
QuerySenderListener sending requests to Searcher@6ff86671[collection1]
main{StandardDir
ectoryReader(segments_1:1:nrt)}
INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.QuerySenderListener;
QuerySenderListener done.
INFO  - 2016-02-23 02:47:15.951; org.apache.solr.core.SolrCore;
[collection1] Registered new searcher Searcher@6ff86671[collection1]
main{StandardDirectoryReader(
segments_1:1:nrt)}
INFO  - 2016-02-23 02:47:15.952; org.apache.solr.core.SolrCore;
[collection1] Closing main searcher on request.
INFO  - 2016-02-23 02:47:15.953;
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
path=/admin/cores params={action=RELOAD&_=1456213620483&core=coll
ection1&wt=json} status=0 QTime=1375
INFO  - 2016-02-23 02:47:15.986;
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null
path=/admin/cores params={_=1456213621893&wt=json} status=0 QTime
=1
[root@localhost logs]#          1959432 [qtp969637605-11] INFO
org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
webapp=/solr path=/update par             ams={wt=javabin&version=2} {} 0 55
1959433 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
org.apache.solr.common.SolrException:
* ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
tion'        at *
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
        at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)


        at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
        at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
        at java.lang.Thread.run(Thread.java:745)

1959465 [qtp969637605-11] INFO
org.apache.solr.update.processor.LogUpdateProcessor  â [collection1]
webapp=/solr path=/update params={wt=javabin&version=2} {} 0              2
1959466 [qtp969637605-11] ERROR org.apache.solr.core.SolrCore  â
org.apache.solr.common.SolrException:


*ERROR: [doc=com.alco.www:http] unknown field 'meta_descrip
tion'        at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:185)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:78)*





****************************************************************

NUTCH (hadoop.log)

2016-02-23 02:58:40,659 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2016-02-23 02:58:40,659 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2016-02-23 02:58:40,701
* WARN  store.HBaseStore - Mismatching schema's names. Mappingfile schema:
'webpage'. PersistentClass schema's name: 'webpage_webpage'Assuming they
are the same.*
2016-02-23 02:58:41,158 INFO  solr.SolrIndexWriter - Adding 1 documents
2016-02-23 02:58:41,574 INFO  solr.SolrIndexWriter - Adding 1 documents
2016-02-23 02:58:41,614 WARN  mapred.LocalJobRunner -
job_local1682084779_0001
java.lang.Exception:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
*ERROR: [doc=com.galco.www:http] unknown field 'meta_description'        at
*
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [doc=com.galco.www:http] unknown field 'meta_description'
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:491)
        at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:197)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
        at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:97)
        at
org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:114)
        at
org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
        at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-02-23 02:58:41,653* ERROR indexer.IndexingJob - SolrIndexerJob:
java.lang.RuntimeException: job failed: name=[webpage]Indexer,
jobid=job_local1682084779_0001*
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)



On Mon, Feb 22, 2016 at 9:52 PM, Binoy Dalal <[email protected]> wrote:

> What errors do you see in hadoop.log and solr's solr.log?
> Post that stack trace.
>
> On Tue, 23 Feb 2016, 07:29 Tom Running <[email protected]> wrote:
>
> > Got errors:  when run this command:
> >  ./crawl ../urls/ TestCrawler http://localhost:8983/solr 1
> > Have any idea where to go from here?
> >
> > thank you.
> > Tom
> >
> >
> >
> > ParserJob: finished at 2016-02-22 20:22:47, time elapsed: 00:00:11
> > CrawlDB update for TestCrawler
> > /home/nutch/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D
> > mapred.child.java.opts=-Xmx1000m -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapred.compress.map.output=true 1456190521-30847 -crawlId TestCrawler
> > DbUpdaterJob: starting at 2016-02-22 20:22:49
> > DbUpdaterJob: batchId: 1456190521-30847
> > DbUpdaterJob: finished at 2016-02-22 20:22:58, time elapsed: 00:00:09
> > Indexing TestCrawler on SOLR index -> http://localhost:8983/solr
> > /home/nutch/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D
> > mapred.child.java.opts=-Xmx1000m -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapred.compress.map.output=true -D solr.server.url=
> > http://localhost:8983/solr -all -crawlId TestCrawler
> > IndexingJob: starting
> > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > name=[TestCrawler]Indexer, jobid=job_local1592190856_0001
> >         at
> > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:119)
> >         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154)
> >         at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176)
> >         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> >         at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
> >
> >
> >
> >
> > *Error running:  /home/nutch/runtime/local/bin/nutch index -D
> > mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
> > mapred.reduce.tasks.speculative.execution=false -D
> > mapred.map.tasks.speculative.execution=false -D
> > mapred.compress.map.output=true -D
> > solr.server.url=http://localhost:8983/solr <http://localhost:8983/solr>
> > -all -crawlId TestCrawlerFailed with exit value 255.*
> >
> > On Mon, Feb 22, 2016 at 1:58 AM, Binoy Dalal <[email protected]>
> > wrote:
> >
> > > CrawlID put any number
> > > Number of rounds >=1
> > > Seed dir and solr URL is proper
> > >
> > > On Mon, 22 Feb 2016, 12:17 Tom Running <[email protected]> wrote:
> > >
> > > > Binoy,
> > > >
> > > > I do see the information on the console and also lot of information
> in
> > > > hbase.
> > > >
> > > > I tried ./crawl  but not quite sure where to location the following
> > > > information:
> > > >
> > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > >
> > > > seedDir   ../urls/seed.txt  ?
> > > > crawID  ?
> > > > solrUrl     I am guessing this will be http://localhost:8983/solr/
> > > > numberOfRounds  ?
> > > >
> > > > Could you provide some advice on how to determine the above
> > information.
> > > >
> > > >
> > > > Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
> > > >
> > > > Thanks,
> > > > Tom
> > > >
> > > >
> > > >
> > > > On Mon, Feb 22, 2016 at 1:19 AM, Binoy Dalal <[email protected]
> >
> > > > wrote:
> > > >
> > > > > When you run the inject and generate commands, in the console
> output
> > do
> > > > you
> > > > > see your site being added?
> > > > > Also while fetching and parsing you should be able to see the
> number
> > of
> > > > > successful fetches and parse actions in your console. Ideally this
> > > should
> > > > > be equal to or more than the number of sites you've put in the
> > seed.txt
> > > > > file.
> > > > > If this is not the case then there is some issue with either your
> > > > seed.txt
> > > > > file or the regex-urlfilter file.
> > > > >
> > > > > While running the crawl command, you doing need to index to solr
> > > > > separately. The command will do it for you.
> > > > > Run ./crawl to see usage instructions.
> > > > >
> > > > > On Mon, 22 Feb 2016, 11:41 Tom Running <[email protected]>
> > wrote:
> > > > >
> > > > > > Yes, I did ran these before run ./nutch solrindex
> > > > > > http://localhost:8983/solr/ -all and get nothing.
> > > > > >
> > > > > >
> > > > > > From /home/nutch/runtime/local/bin/
> > > > > >
> > > > > > ./nutch inject ../urls/seed.txt
> > > > > > ./nutch readdb
> > > > > > ./nutch generate -topN 2500
> > > > > > ./nutch fetch -all
> > > > > > ./nutch parse -all
> > > > > > ./nutch updatedb
> > > > > >
> > > > > > Did not run the crawl command.
> > > > > >
> > > > > > Would I just run ./crawl ??
> > > > > > then run this again ./nutch solrindex
> http://localhost:8983/solr/
> > > -all
> > > > > >
> > > > > > Thank you very much for response to my questions.
> > > > > >
> > > > > > Tom
> > > > > >
> > > > > >
> > > > > > On Sun, Feb 21, 2016 at 11:25 PM, Binoy Dalal <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Just to be clear, you did run the preceding nutch commands to
> > > inject,
> > > > > > > generate, fetch and parse the URLs right?
> > > > > > >
> > > > > > > Additionally try with the ./crawl command to directly crawl and
> > > index
> > > > > > > everything to solr without having to manually run all the
> steps.
> > > > > > >
> > > > > > > On Mon, 22 Feb 2016, 07:24 Tom Running <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > I am trying to get Nutch to run solrindex and having problem.
> > I
> > > am
> > > > > > using
> > > > > > > > the following instruction from
> > > > > > > > this document http://wiki.apache.org/nutch/Nutch2Tutorial.
> > > > > Everything
> > > > > > > > are working except when I ran the following command.
> > > > > > > >
> > > > > > > >
> > > > > > > > *./nutch solrindex http://localhost:8983/solr <
> > > > > > > http://localhost:8983/solr>
> > > > > > > > -all*
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > ****** it came back with the following info  *****
> > > > > > > > ****** It seems to have problem with indexing ****
> > > > > > > > IndexingJob: starting
> > > > > > > > Active IndexWriters :
> > > > > > > > SOLRIndexWriter
> > > > > > > >         solr.server.url : URL of the SOLR instance
> (mandatory)
> > > > > > > >         solr.commit.size : buffer size when sending to SOLR
> > > > (default
> > > > > > > 1000)
> > > > > > > >         solr.mapping.file : name of the mapping file for
> fields
> > > > > > (default
> > > > > > > > solrindex-mapping.xml)
> > > > > > > >         solr.auth : use authentication* (default false)*
> > > > > > > >         solr.auth.username : username for authentication
> > > > > > > >         solr.auth.password : password for authentication
> > > > > > > > IndexingJob: done.
> > > > > > > >
> > > > > > > >
> > > > > > > > When I launch the SOLR Web UI interface can not query or find
> > any
> > > > > > things
> > > > > > > > under the default collection1 or the
> > > gettingstarted_shard1_replica1
> > > > > or
> > > > > > > > gettingstarted_shard2_replica1
> > > > > > > >
> > > > > > > >
> > > > > > > > I have also tried with this option (with the colletion1) and
> > > still
> > > > > not
> > > > > > > > able to query anything.
> > > > > > > > ./nutch solrindex http://localhost:8983/solr/collection1
> -all
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > After download SOLR 4.10.3 and start it as it with command
> > > > > > > > /home/solr/bin/solr start -e cloud -noprompt
> > > > > > > >
> > > > > > > > I did not modify any configuration file not posting any file
> or
> > > > > > directory
> > > > > > > > from within SOLR. I am assuming this command ./nutch
> solrindex
> > > > > > > > http://localhost:8983/solr/collection1 will do all the
> posting
> > > and
> > > > > > index
> > > > > > > > for SOLR.
> > > > > > > >
> > > > > > > > Any ideas what am I missing here.  Any advice where to go
> from
> > > here
> > > > > > > would
> > > > > > > > be greatly appreciate.
> > > > > > > >
> > > > > > > > I Did tried copy /nutch/runtime/local/conf/*.*   into SOLR
> and
> > it
> > > > did
> > > > > > not
> > > > > > > > make any different.
> > > > > > > >
> > > > > > > > Thank you.
> > > > > > > >
> > > > > > > > Tom
> > > > > > > >
> > > > > > > > --
> > > > > > > Regards,
> > > > > > > Binoy Dalal
> > > > > > >
> > > > > >
> > > > > --
> > > > > Regards,
> > > > > Binoy Dalal
> > > > >
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: Nutch 2.3.1 doesn't work with Solr 4.10.3 and Hbase

Reply via email to