Lewis,

Thanks for looking at this. SOL has newest payched schema and I restarted
tomcat.

I set DEBUG for SolrIndexerJob in log4j.properties file

log4j.logger.org.apache.nutch.indexer.solr.SolrIndexerJob=DEBUG,cmdstdout

>Can I
>also suggest that you experiment with the crawl script (which
>accompanies the nutch script) instead of using the deprecated crawl
>command.

Where is this script? bin folder has only nutch script.

>perhaps attempt to set the SolrIndexerJob logging to DEBUG and review
>your hadoop.log as well. I can confirm that I was able to get Nutch
>trunk working with a standalone Solr 4.0 multicore server with the
>patch applied just last week.

I am using nutch 2.1 not trunk. Does it make any difference on behavior of
nutch script?
Can you give me main points, maybe a scripts of what is your full steps,
on how you tested and got this working last week?


I am getting this in hadop.log

2012-11-13 10:34:50,466 INFO  solr.SolrIndexerJob - SolrIndexerJob: starting
2012-11-13 10:34:50,805 INFO  plugin.PluginRepository - Plugins: looking
in: /home/eakarsu/searchProject/apache-nutch-2.1/runtime/local/plugins
2012-11-13 10:34:50,867 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2012-11-13 10:34:50,867 INFO  plugin.PluginRepository - Registered Plugins:
2012-11-13 10:34:50,867 INFO  plugin.PluginRepository -     the nutch core
extension points (nutch-extensionpoints)
2012-11-13 10:34:50,867 INFO  plugin.PluginRepository -     Basic URL
Normalizer (urlnormalizer-basic)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Basic Indexing
Filter (index-basic)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Html Parse
Plug-in (parse-html)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     HTTP Framework
(lib-http)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Pass-through
URL Normalizer (urlnormalizer-pass)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Regex URL
Filter (urlfilter-regex)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Http Protocol
Plug-in (protocol-http)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Regex URL
Normalizer (urlnormalizer-regex)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Tika Parser
Plug-in (parse-tika)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     OPIC Scoring
Plug-in (scoring-opic)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     CyberNeko HTML
Parser (lib-nekohtml)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Anchor Indexing
Filter (index-anchor)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Regex URL
Filter Framework (lib-regex-filter)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository - Registered
Extension-Points:
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Parse Filter
(org.apache.nutch.parse.ParseFilter)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch URL
Filter (org.apache.nutch.net.URLFilter)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch Content
Parser (org.apache.nutch.parse.Parser)
2012-11-13 10:34:50,868 INFO  plugin.PluginRepository -     Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2012-11-13 10:34:50,872 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: 100
2012-11-13 10:34:50,872 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2012-11-13 10:34:50,875 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2012-11-13 10:34:50,875 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-11-13 10:34:51,891 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2012-11-13 10:34:52,765 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: content
dest: content
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: site dest:
site
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: title dest:
title
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: host dest:
host
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: segment
dest: segment
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: boost dest:
boost
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: digest dest:
digest
2012-11-13 10:34:52,818 INFO  solr.SolrMappingReader - source: tstamp dest:
tstamp
2012-11-13 10:34:52,821 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: 100
2012-11-13 10:34:52,821 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2012-11-13 10:34:52,821 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2012-11-13 10:34:52,821 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2012-11-13 10:34:55,434 WARN  mapred.FileOutputCommitter - Output path is
null in cleanup
2012-11-13 10:34:56,455 ERROR solr.SolrIndexerJob - SolrIndexerJob:
org.apache.solr.common.SolrException: Not Found

Not Found

request: http://localhost:8080/sol40/update
    at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
    at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
    at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
    at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
    at org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
    at
org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:60)
    at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:75)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at
org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:84)


On Tue, Nov 13, 2012 at 9:53 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi,
>
> On Tue, Nov 13, 2012 at 2:36 PM, Erol Akarsu <[email protected]> wrote:
> > Lewis,
> >
> > I applied the patch you told me. I replaced schema.xml of sol4
> installation
> > with schme-sol4.xml. Solr 4.0 system is up and running and I can see its
> > web page with http://localhost:8080/sol40.
>
> You would need to either rename schema-solr4.xml to schema, then copy
> this to your tomcat solr installation before starting/restarting the
> server or alternatively copy the contents of the newly patched file to
> the solr existing schema.xml
>
> >
> > I followed tutorial blindly. Crawling went fine but it seem very slow
> > compared to previous before patch applied
>
> Considering the patch only applies to the Solr indexing stage crawl
> performance should not be affected in the slightest. Especially when
> you are not passing the solr server URL during the crawl phase. Can I
> also suggest that you experiment with the crawl script (which
> accompanies the nutch script) instead of using the deprecated crawl
> command.
>
> perhaps attempt to set the SolrIndexerJob logging to DEBUG and review
> your hadoop.log as well. I can confirm that I was able to get Nutch
> trunk working with a standalone Solr 4.0 multicore server with the
> patch applied just last week.
>
> As I said, Markus has also suggested some additions to the patch so
> maybe try catching some irregularities... trial and error.
>
> hth
>
> Lewis
>

Reply via email to