Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Drulea, Sherban Wed, 30 Sep 2015 17:22:33 -0700

Hi Lewis,


On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <[email protected]>
wrote:

>Hi Sherban,
>
>On Wed, Sep 30, 2015 at 6:46 AM, <[email protected]>
>wrote:
>
>>
>> I tried with SOLR 4.9.1.
>>
>
>OK. As I said Solr 4.6 is supported but never mind.

OK. I¹m using SOLR 4.6.0.

I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file
from https://github.com/apache/nutch/blob/2.x/conf/schema.xml.

When I start SOLR 4.6.0. With "java -jar start.jar², I get this error:
1094 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.update.SolrIndexConfig   IndexWriter infoStream solr
logging is enabled
1097 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.SolrConfig
 Using Lucene MatchVersion: LUCENE_46
1160 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.Config  
Loaded SolrConfig: solrconfig.xml
1164 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema   Reading Solr Schema from schema.xml
1176 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema   [collection1] Schema name=nutch
1241 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema   default search field in schema is
text
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema   query parser default operator is OR
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema   unique key field: id
1243 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer   Unable to create core: collection1
org.apache.solr.common.SolrException: copyField source :'rawcontent' is
not a glob and doesn't match any explicit field or dynamicField.. Schema
file is 
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
        at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
        at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
        at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
        at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
        ... 13 more
1245 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer  
null:org.apache.solr.common.SolrException: Unable to create core:
collection1
        at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
        at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
        at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
        at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
        at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
        ... 8 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
        at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
        ... 13 more

1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  
user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example
1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  
SolrDispatchFilter.init() done
1263 [main] INFO  org.eclipse.jetty.server.AbstractConnector   Started
[email protected]:8983


The only changes I made to schema.xml were to comment out lines with
³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1
schema.xml with SOLR 4.6.1?

>
>
>>
>> I copied /release-2.3.1/runtime/local/conf/schema.xml to
>> solr-4.9.1/example/solr/collection1/conf/schema.xml
>>
>
>Good.
>
>
>>
>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
>> http://localhost:8983/solr 2
>>
>>
>> InjectorJob: total number of urls rejected by filters: 1
>>
>
>Notice that you regex urlfilter is rejecting one of your seed URLs.

One of my original URLs ended with ³/". I added index.html and that fixed
the rejection.

InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 11


>
>
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 5
>>
>
>[...snip]
>
>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch now
>>
>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?
>>
>
>1 was rejected as explained above. Additionally, it seems like there is
>also an error fetching your seeds and parsing out hyperlinks. I would
>encourage you to check the early stages of configuring and prepping your
>crawler. Some configuration is incorrect... possibly more problems with
>your regex urlfilters.

My regex-urlfilter.txt is unmodified:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP
|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm
p|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.


I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient &
indexer-solr
<property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>

        <description>Regular expression naming plugin directory names to
         include.  Any plugin not matching this expression is excluded.
         In any case you need at least include the nutch-extensionpoints
plugin. By
         default Nutch includes crawling just HTML and plain text via HTTP,
         and basic indexing and search plugins. In order to use HTTPS
please enable 
         protocol-httpclient, but be aware of possible intermittent
problems with the 
         underlying commons-httpclient library.
         </description>
   </property>


Nutch still doesn¹t parse any links. Any ideas?

InjectorJob: total number of urls injected after normalization and
filtering: 11
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443657910-4394 -crawlId method_centers -threads 50
FetcherJob: starting at 2015-09-30 17:05:14
FetcherJob: batchId: 1443657910-4394
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443668714323
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues




-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.

-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues


Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers
ParserJob: starting at 2015-09-30 17:05:27
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: batchId:     1443657910-4394
ParserJob: success
ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02
CrawlDB update for method_centers
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers
DbUpdaterJob: starting at 2015-09-30 17:05:30
DbUpdaterJob: batchId: 1443657910-4394
DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02
Indexing method_centers on SOLR index -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers




>
>
>>
>>
>> The index job worked but there¹s no data in SOLR. Is there a known good
>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
>> instructions still valid?
>>
>
>Not it did not. It failed. Look at the hadoop.log.
>Also please look at your solr.log, it will provide you with better insight
>into what is wrong with your Solr server and what messages are failing.
>Thanks

The nutch schema.xml doesn¹t work on my SOLR 4.6.0:

IndexingJob: starting
No IndexWriters activated - check your configuration

IndexingJob: done.
SOLR dedup -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://localhost:8983/solr
Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Expected content type application/octet-stream but got
text/html;charset=ISO-8859-1. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'collection1' is not available due to init
failure: copyField source :'rawcontent' is not a glob and doesn't match
any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tra
ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not
available due to init failure: copyField source :'rawcontent' is not a
glob and doesn't match any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:297)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:197)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle
r.java:1419)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
7)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
        at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav
a:231)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav
a:1075)
        at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
        at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java
:193)
        at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java
:1009)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
5)
        at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan
dlerCollection.java:255)
        at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection
.java:154)
        at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:
116)
        at org.eclipse.jetty.server.Server.handle(Server.java:368)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpC
onnection.java:489)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpC
onnection.java:53)
        at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttp
Connection.java:942)
        at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComple
te(AbstractHttpConnection.java:1004)
        at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
        at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
        at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnecti
on.java:72)
        at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketCo
nnector.java:264)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java
:608)
        at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:
543)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
        at org.apache.solr.schema.IndexSchema.&lt;init&gt;(IndexSchema.java:166)
        at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
        at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
        at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
        at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
        ... 1 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
        at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
        at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
        ... 13 more



Cheers,
Sherban


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Reply via email to