Hi Lewis,
On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <[email protected]> wrote: >Hi Sherban, > >On Wed, Sep 30, 2015 at 6:46 AM, <[email protected]> >wrote: > >> >> I tried with SOLR 4.9.1. >> > >OK. As I said Solr 4.6 is supported but never mind. OK. I¹m using SOLR 4.6.0. I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file from https://github.com/apache/nutch/blob/2.x/conf/schema.xml. When I start SOLR 4.6.0. With "java -jar start.jar², I get this error: 1094 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.update.SolrIndexConfig IndexWriter infoStream solr logging is enabled 1097 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrConfig Using Lucene MatchVersion: LUCENE_46 1160 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.Config Loaded SolrConfig: solrconfig.xml 1164 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.schema.IndexSchema Reading Solr Schema from schema.xml 1176 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.schema.IndexSchema [collection1] Schema name=nutch 1241 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.schema.IndexSchema default search field in schema is text 1242 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.schema.IndexSchema query parser default operator is OR 1242 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.schema.IndexSchema unique key field: id 1243 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer Unable to create core: collection1 org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField.. Schema file is /Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55 ) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto ry.java:69) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField. at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) ... 13 more 1245 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField.. Schema file is /Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55 ) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto ry.java:69) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) ... 8 more Caused by: org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField. at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) ... 13 more 1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example 1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter SolrDispatchFilter.init() done 1263 [main] INFO org.eclipse.jetty.server.AbstractConnector Started [email protected]:8983 The only changes I made to schema.xml were to comment out lines with ³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1 schema.xml with SOLR 4.6.1? > > >> >> I copied /release-2.3.1/runtime/local/conf/schema.xml to >> solr-4.9.1/example/solr/collection1/conf/schema.xml >> > >Good. > > >> >> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers >> http://localhost:8983/solr 2 >> >> >> InjectorJob: total number of urls rejected by filters: 1 >> > >Notice that you regex urlfilter is rejecting one of your seed URLs. One of my original URLs ended with ³/". I added index.html and that fixed the rejection. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 11 > > >> InjectorJob: total number of urls injected after normalization and >> filtering: 5 >> > >[...snip] > >GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs >> Generate returned 1 (no new segments created) >> Escaping loop: no more URLs to fetch now >> >> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs? >> > >1 was rejected as explained above. Additionally, it seems like there is >also an error fetching your seeds and parsing out hyperlinks. I would >encourage you to check the early stages of configuring and prepping your >crawler. Some configuration is incorrect... possibly more problems with >your regex urlfilters. My regex-urlfilter.txt is unmodified: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP |ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm p|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient & indexer-solr <property> <name>plugin.includes</name> <value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-( basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v alue> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> Nutch still doesn¹t parse any links. Any ideas? InjectorJob: total number of urls injected after normalization and filtering: 11 /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D fetcher.timelimit.mins=180 1443657910-4394 -crawlId method_centers -threads 50 FetcherJob: starting at 2015-09-30 17:05:14 FetcherJob: batchId: 1443657910-4394 FetcherJob: threads: 50 FetcherJob: parsing: false FetcherJob: resuming: false FetcherJob : timelimit set for : 1443668714323 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0 Š. Fetcher: throughput threshold sequence: 5 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues -activeThreads=0 Using queue mode : byHost Fetcher: threads: 50 QueueFeeder finished: total 0 records. Hit by time limit :0 Š. -finishing thread FetcherThread49, activeThreads=0 0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues Parsing : /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers ParserJob: starting at 2015-09-30 17:05:27 ParserJob: resuming: false ParserJob: forced reparse: false ParserJob: batchId: 1443657910-4394 ParserJob: success ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02 CrawlDB update for method_centers /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers DbUpdaterJob: starting at 2015-09-30 17:05:30 DbUpdaterJob: batchId: 1443657910-4394 DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02 Indexing method_centers on SOLR index -> http://localhost:8983/solr /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr -all -crawlId method_centers > > >> >> >> The index job worked but there¹s no data in SOLR. Is there a known good >> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial >> instructions still valid? >> > >Not it did not. It failed. Look at the hadoop.log. >Also please look at your solr.log, it will provide you with better insight >into what is wrong with your Solr server and what messages are failing. >Thanks The nutch schema.xml doesn¹t work on my SOLR 4.6.0: IndexingJob: starting No IndexWriters activated - check your configuration IndexingJob: done. SOLR dedup -> http://localhost:8983/solr /Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true http://localhost:8983/solr Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected content type application/octet-stream but got text/html;charset=ISO-8859-1. <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 500 {msg=SolrCore 'collection1' is not available due to init failure: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField.. Schema file is /Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tra ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not available due to init failure: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField.. Schema file is /Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java :297) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java :197) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle r.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13 7) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav a:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav a:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java :193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java :1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13 5) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan dlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection .java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java: 116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpC onnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpC onnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttp Connection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComple te(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnecti on.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketCo nnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java :608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java: 543) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField.. Schema file is /Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608) at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166) at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55 ) at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto ry.java:69) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1 142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java: 617) ... 1 more Caused by: org.apache.solr.common.SolrException: copyField source :'rawcontent' is not a glob and doesn't match any explicit field or dynamicField. at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592) ... 13 more Cheers, Sherban __________________________________________________________________________ This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

