Uncommenting <copyField source="rawcontent" dest="text”/> in schema.xml fixed the issue with SOLR.
Now there are no error messages but also no parsing :(. My seed.txt: --------------------------------------------------------------------------- ------- http://intranet.rand.org/eprm/rand-initiated-research/proposals/fy2015/inde x.html http://intranet.rand.org/eprm/rand-initiated-research/2015.html http://intranet.rand.org/eprm/rand-initiated-research/faq.html http://intranet.rand.org/eprm/rand-initiated-research/index.html --------------------------------------------------------------------------- ------- My nutch-site.xml: --------------------------------------------------------------------------- ------- <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>nutch Mongo Solr Crawler</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> <property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v alue> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> </configuration> --------------------------------------------------------------------------- ------- My regex-urlfilter.txt: --------------------------------------------------------------------------- ------- # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP |ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm p|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. ————————————————————————————————————————— I see these warnings in my hadoop.log: 2015-09-30 17:32:53,466 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 015-09-30 17:32:54,571 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca l1728069154_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-09-30 17:32:54,573 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca l1728069154_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2015-09-30 17:32:54,652 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691 54_0001/job_local1728069154_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2015-09-30 17:32:54,654 WARN conf.Configuration - file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691 54_0001/job_local1728069154_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. Any ideas? On 9/30/15, 5:21 PM, "Drulea, Sherban" <[email protected]> wrote: >Hi Lewis, > > >On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <[email protected]> >wrote: > >>Hi Sherban, >> >>On Wed, Sep 30, 2015 at 6:46 AM, <[email protected]> >>wrote: >> >>> >>> I tried with SOLR 4.9.1. >>> >> >>OK. As I said Solr 4.6 is supported but never mind. > >OK. I¹m using SOLR 4.6.0. > >I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file >from https://github.com/apache/nutch/blob/2.x/conf/schema.xml. > >When I start SOLR 4.6.0. With "java -jar start.jar², I get this error: >1094 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.update.SolrIndexConfig IndexWriter infoStream solr >logging is enabled >1097 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrConfig > Using Lucene MatchVersion: LUCENE_46 >1160 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.Config >Loaded SolrConfig: solrconfig.xml >1164 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.schema.IndexSchema Reading Solr Schema from schema.xml >1176 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.schema.IndexSchema [collection1] Schema name=nutch >1241 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.schema.IndexSchema default search field in schema is >text >1242 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.schema.IndexSchema query parser default operator is OR >1242 [coreLoadExecutor-3-thread-1] INFO >org.apache.solr.schema.IndexSchema unique key field: id >1243 [coreLoadExecutor-3-thread-1] ERROR >org.apache.solr.core.CoreContainer Unable to create core: collection1 >org.apache.solr.common.SolrException: copyField source :'rawcontent' is >not a glob and doesn't match any explicit field or dynamicField.. Schema >file is >/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) > at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) > at >org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5 >5 >) > at >org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact >o >ry.java:69) > at >org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: >1 >142) > at >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java >: >617) > at java.lang.Thread.run(Thread.java:745) >Caused by: org.apache.solr.common.SolrException: copyField source >:'rawcontent' is not a glob and doesn't match any explicit field or >dynamicField. > at >org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) > ... 13 more >1245 [coreLoadExecutor-3-thread-1] ERROR >org.apache.solr.core.CoreContainer >null:org.apache.solr.common.SolrException: Unable to create core: >collection1 > at >org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: >1 >142) > at >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java >: >617) > at java.lang.Thread.run(Thread.java:745) >Caused by: org.apache.solr.common.SolrException: copyField source >:'rawcontent' is not a glob and doesn't match any explicit field or >dynamicField.. Schema file is >/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) > at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) > at >org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5 >5 >) > at >org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact >o >ry.java:69) > at >org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) > ... 8 more >Caused by: org.apache.solr.common.SolrException: copyField source >:'rawcontent' is not a glob and doesn't match any explicit field or >dynamicField. > at >org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) > ... 13 more > >1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter >user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example >1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter >SolrDispatchFilter.init() done >1263 [main] INFO org.eclipse.jetty.server.AbstractConnector Started >[email protected]:8983 > > >The only changes I made to schema.xml were to comment out lines with >³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1 >schema.xml with SOLR 4.6.1? > >> >> >>> >>> I copied /release-2.3.1/runtime/local/conf/schema.xml to >>> solr-4.9.1/example/solr/collection1/conf/schema.xml >>> >> >>Good. >> >> >>> >>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers >>> http://localhost:8983/solr 2 >>> >>> >>> InjectorJob: total number of urls rejected by filters: 1 >>> >> >>Notice that you regex urlfilter is rejecting one of your seed URLs. > >One of my original URLs ended with ³/". I added index.html and that fixed >the rejection. > >InjectorJob: total number of urls rejected by filters: 0 >InjectorJob: total number of urls injected after normalization and >filtering: 11 > > >> >> >>> InjectorJob: total number of urls injected after normalization and >>> filtering: 5 >>> >> >>[...snip] >> >>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs >>> Generate returned 1 (no new segments created) >>> Escaping loop: no more URLs to fetch now >>> >>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs? >>> >> >>1 was rejected as explained above. Additionally, it seems like there is >>also an error fetching your seeds and parsing out hyperlinks. I would >>encourage you to check the early stages of configuring and prepping your >>crawler. Some configuration is incorrect... possibly more problems with >>your regex urlfilters. > >My regex-urlfilter.txt is unmodified: ># skip file: ftp: and mailto: urls >-^(file|ftp|mailto): > ># skip image and other suffixes we can't yet parse ># for a more extensive coverage use the urlfilter-suffix plugin >-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZI >P >|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|b >m >p|BMP|js|JS)$ > ># skip URLs containing certain characters as probable queries, etc. >-[?*!@=] > ># skip URLs with slash-delimited segment that repeats 3+ times, to break >loops >-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > ># accept anything else >+. > > >I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient & >indexer-solr ><property> > <name>plugin.includes</name> > ><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index- >( >basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</ >v >alue> > > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints >plugin. By > default Nutch includes crawling just HTML and plain text via >HTTP, > and basic indexing and search plugins. In order to use HTTPS >please enable > protocol-httpclient, but be aware of possible intermittent >problems with the > underlying commons-httpclient library. > </description> > </property> > > >Nutch still doesn¹t parse any links. Any ideas? > >InjectorJob: total number of urls injected after normalization and >filtering: 11 >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D fetcher.timelimit.mins=180 >1443657910-4394 -crawlId method_centers -threads 50 >FetcherJob: starting at 2015-09-30 17:05:14 >FetcherJob: batchId: 1443657910-4394 >FetcherJob: threads: 50 >FetcherJob: parsing: false >FetcherJob: resuming: false >FetcherJob : timelimit set for : 1443668714323 >Using queue mode : byHost >Fetcher: threads: 50 >QueueFeeder finished: total 0 records. Hit by time limit :0 >Š. >Fetcher: throughput threshold sequence: 5 >0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs >in 0 queues > > > > >-activeThreads=0 >Using queue mode : byHost >Fetcher: threads: 50 >QueueFeeder finished: total 0 records. Hit by time limit :0 >Š. > >-finishing thread FetcherThread49, activeThreads=0 >0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs >in 0 queues > > >Parsing : >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D >mapred.skip.attempts.to.start.skipping=2 -D >mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers >ParserJob: starting at 2015-09-30 17:05:27 >ParserJob: resuming: false >ParserJob: forced reparse: false >ParserJob: batchId: 1443657910-4394 >ParserJob: success >ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02 >CrawlDB update for method_centers >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers >DbUpdaterJob: starting at 2015-09-30 17:05:30 >DbUpdaterJob: batchId: 1443657910-4394 >DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02 >Indexing method_centers on SOLR index -> http://localhost:8983/solr >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true -D >solr.server.url=http://localhost:8983/solr -all -crawlId method_centers > > > > >> >> >>> >>> >>> The index job worked but there¹s no data in SOLR. Is there a known good >>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial >>> instructions still valid? >>> >> >>Not it did not. It failed. Look at the hadoop.log. >>Also please look at your solr.log, it will provide you with better >>insight >>into what is wrong with your Solr server and what messages are failing. >>Thanks > >The nutch schema.xml doesn¹t work on my SOLR 4.6.0: > >IndexingJob: starting >No IndexWriters activated - check your configuration > >IndexingJob: done. >SOLR dedup -> http://localhost:8983/solr >/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D >mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D >mapred.reduce.tasks.speculative.execution=false -D >mapred.map.tasks.speculative.execution=false -D >mapred.compress.map.output=true http://localhost:8983/solr >Exception in thread "main" >org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >Expected content type application/octet-stream but got >text/html;charset=ISO-8859-1. <html> ><head> ><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> ><title>Error 500 {msg=SolrCore 'collection1' is not available due to init >failure: copyField source :'rawcontent' is not a glob and doesn't match >any explicit field or dynamicField.. Schema file is >/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tr >a >ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not >available due to init failure: copyField source :'rawcontent' is not a >glob and doesn't match any explicit field or dynamicField.. Schema file is >/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml > at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818) > at >org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav >a >:297) > at >org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav >a >:197) > at >org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl >e >r.java:1419) > at >org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) > at >org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1 >3 >7) > at >org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557 >) > at >org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja >v >a:231) > at >org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja >v >a:1075) > at >org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) > at >org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav >a >:193) > at >org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav >a >:1009) > at >org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1 >3 >5) > at >org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa >n >dlerCollection.java:255) > at >org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio >n >.java:154) > at >org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java >: >116) > at org.eclipse.jetty.server.Server.handle(Server.java:368) > at >org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttp >C >onnection.java:489) > at >org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttp >C >onnection.java:53) > at >org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHtt >p >Connection.java:942) > at >org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerCompl >e >te(AbstractHttpConnection.java:1004) > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) > at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) > at >org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnect >i >on.java:72) > at >org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketC >o >nnector.java:264) > at >org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav >a >:608) > at >org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java >: >543) > at java.lang.Thread.run(Thread.java:745) >Caused by: org.apache.solr.common.SolrException: copyField source >:'rawcontent' is not a glob and doesn't match any explicit field or >dynamicField.. Schema file is >/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) > at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) > at >org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5 >5 >) > at >org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact >o >ry.java:69) > at >org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) > at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at >java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java: >1 >142) > at >java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java >: >617) > ... 1 more >Caused by: org.apache.solr.common.SolrException: copyField source >:'rawcontent' is not a glob and doesn't match any explicit field or >dynamicField. > at >org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) > at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) > ... 13 more > > > >Cheers, >Sherban > > >__________________________________________________________________________ > >This email message is for the sole use of the intended recipient(s) and >may contain confidential information. Any unauthorized review, use, >disclosure or distribution is prohibited. If you are not the intended >recipient, please contact the sender by reply email and destroy all copies >of the original message. >

