Hi Luis,

the reason of issue is understood very hard. I have a solution but I am not sure :) When I looked your nutch-site.xml. You activated less plugins. If you dont use urlfilter-validator, when you parse websites, Parser generate unvalidate urls like as http://# or ???###eee etc. When I try to get your error. I am getting your issue, if my document has not valdiate url. Because of your issue, your unvalidate urls. Can you add urlfilter-validator plugin in your nutch-site.xml, drop your db and solr collection start again crawling.

Little tips:

For delete your table in hbase shell:
truncate "webpage"

For deleting your solr index :
localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete> (This delete your index) localhost:8983/solr/update?commit=true ( sometimes need For afffect your request )

Your second question is solr mail list question. I believe they can give more information about solr. But if you want to look your index, you can use this url:
http://localhost:8983/solr/collection1/select?q=*%3A*&start=0&rows=30&wt=xml&indent=true

Actually you rgiht we need solr integration document. I learn very well. Can you write document about Solr Integration with nutch. I can review it and we will publish our wiki. What about you ?
I hope I will hear your good news :)
Hava a nice day
Talat


21-10-2013 17:52 tarihinde, Luis Armando Roca Fumero yazdı:
Sorry I forgot the solr.log:  http://pastebin.com/XAL58zbL
I hope you can help me, thanks in advance
Luis Armando
________________________________________
De: Luis Armando Roca Fumero
Enviado el: lunes, 21 de octubre de 2013 09:25 a.m.
Para: [email protected]
Asunto: RE: Nutch 1.7 and Solr 4.4.0 Integrate

Good Morning Friends:
In order that I could not solve my problem with Nutch and Solr 4.4.0 1.7/2.2.1 
I intend to publish what I have done from the beginning .
1 - I Downloaded solr 4.4.0
2 - I Downloaded Nutch 1.7
3 - I Copied the file to schema- solr4.xml / example/solr/collection1/conf and 
renamed to schema.xml
4 - When you start solr 4.4.0 , there was the following error: msg = SolrCore ' 
collection1 ' is not available due to init failure:
Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored = "true " and 
multivalued = "false " ( _Version_ does not exist ) , trace = org.apache.solr.common.SolrException : SolrCore ' collection1 ' is 
not available due to init failure: Unable to use updateLog : _version_field must exist in schema , using indexed = "true " stored 
= "true " and multivalued = "false " ( _Version_ does not exist )
5 - To resolve this error was added the following line to schema.xml : <field name="_version_" 
indexed="true" type="long" stored="true"/>
6 - The Nutch configuration files can be found here :
    nutch - site.xml : http://pastebin.com/Dh3tTacL
    regex - urlfilter : http://pastebin.com/eRdxPB1b
    seed.txt : http://pastebin.com/unNgJdmU
7 - When I run the next command: ./bin/nutch solrdedup 
http://localhost:8983/solr/

I get this hadoop.log file:
2013-10-21 14:22:31,645 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
starting at 2013-10-21 14:22:31
2013-10-21 14:22:31,647 INFO  solr.SolrDeleteDuplicates - SolrDeleteDuplicates: 
Solr url: http://localhost:8983/solr/
2013-10-21 14:22:32,050 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2013-10-21 14:22:32,927 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2013-10-21 14:22:32,928 WARN  mapred.LocalJobRunner - job_local741622751_0001
java.lang.Exception: java.lang.NullPointerException
         at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NullPointerException
         at org.apache.hadoop.io.Text.encode(Text.java:388)
         at org.apache.hadoop.io.Text.set(Text.java:178)
         at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270)
         at 
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241)
         at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
         at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
         at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
         at java.lang.Thread.run(Thread.java:724)



Talat, can you explain me how to check solr index for committed documents? 
Sorry, I'm new with solr and nutch.
I don'y know what I'm doing wrong, is necessary change to solr 3.x or solr 
4.4.0 is fine?? Can someone give me a tuto, step by step to integrate solr and 
nutch, I had followed the nutch tutorials in the web:
http://wiki.apache.org/nutch/NutchTutorial , but I can get done the job

Any ideas are welcomed
Thanks for your time, friends,
Luis Armando
________________________________________
De: Talat UYARER [[email protected]]
Enviado el: viernes, 18 de octubre de 2013 10:59 p.m.
Para: [email protected]
Asunto: Re: Nutch 1.7 and Solr 4.4.0 Integrate

Hi Luis,

I am not sure what will be cause that. Did you check your solr index for
committed document ? Maybe it didn't commit. You dont need run all over
nutch jobs. Other jobs works fine. You can only run dedup job with :
bin/nutch solrdedup sorl_url
After that you can you share your solr.log.

Talat


La Universidad Central "Marta Abreu" de Las Villas en su 60 Aniversario. 
Fundada el 30 de noviembre de 1952. Visítenos en:  http://www.uclv.edu.cu
Participe en Universidad 2014, del 10 al 14 de febrero de 2014. Habana. Cuba. 
http://www.congresouniversidad.cu/



Reply via email to