Re: nutch 1.4, solr 3.4 configuration error
I had a similar error. I couldn't find any documentation which nutch and solr versions are compatible. For instance, we' re using nutch 1.6 on hadoop 1.0.4 with solrj 3.4.0 and index crawled segments to solr 4.2.0. But I remember that I could find a compatible version of solrj for nutch 1.4 (because of using hadoop). You can upgrade your nutch from 1.4 to 1.6 easily. And also I suggest you to check for your solrindex-mapping.xml in your /conf directory. Best, Tugcem. On Fri, Jun 7, 2013 at 12:58 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN ... : Caused by: org.apache.solr.common.SolrException: Not Found : : Not Found : : request: http://localhost:8080/select?q=id:[* TO : *]fl=idrows=1wt=javabinversion=2 ... : Other possibly helpful information: : 1) The solr admin screen comes up fine in the browser. At which URL does the Solr admin screen come up fine in your browser? Best guess... 1) you have solr installed such that it uses the webcontext /solr but you gave the wrong url to nutch (ie: try -solr http://localhost:8080/solr;) 2) you are using multiple collections, and you may need to configure nutch to know about which collection you are using (ie: try -solr http://localhost:8080/solr/collection1;) ...if neither of those don't help, i would suggest you follow up with the nutch-user list, as the nutch community is probably in the best position to help you configure nutch to work with Solr and vice versa) -Hoss -- TO
Re: nutch 1.4, solr 3.4 configuration error
can you check if you have correct solrj client library version in both nutch and Solr server. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-1-4-solr-3-4-configuration-error-tp4068724p4068733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch 1.4, solr 3.4 configuration error
: ./nutch crawl urls -dir myCrawl2 -solr http://localhost:8080 -depth 2 -topN ... : Caused by: org.apache.solr.common.SolrException: Not Found : : Not Found : : request: http://localhost:8080/select?q=id:[* TO : *]fl=idrows=1wt=javabinversion=2 ... : Other possibly helpful information: : 1) The solr admin screen comes up fine in the browser. At which URL does the Solr admin screen come up fine in your browser? Best guess... 1) you have solr installed such that it uses the webcontext /solr but you gave the wrong url to nutch (ie: try -solr http://localhost:8080/solr;) 2) you are using multiple collections, and you may need to configure nutch to know about which collection you are using (ie: try -solr http://localhost:8080/solr/collection1;) ...if neither of those don't help, i would suggest you follow up with the nutch-user list, as the nutch community is probably in the best position to help you configure nutch to work with Solr and vice versa) -Hoss
Re: nutch and solr
now, all works! I have another problem If I use a conector with my solr-nutch. this is the error: Grave: java.lang.RuntimeException: org.apache.lucene.index.CorruptIndexException: Unknown format version: -11 at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at org.apache.solr.core.SolrCore.init(SolrCore.java:579) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117) at connector.SolrConnector.init(SolrConnector.java:33) at connector.SolrConnector.getInstance(SolrConnector.java:69) at connector.SolrConnector.getSolrServer(SolrConnector.java:77) at connector.QueryServlet.doGet(QueryServlet.java:117) at javax.servlet.http.HttpServlet.service(HttpServlet.java:621) at javax.servlet.http.HttpServlet.service(HttpServlet.java:722) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.index.CorruptIndexException: Unknown format version: -11 at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247) at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:72) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:69) at org.apache.lucene.index.IndexReader.open(IndexReader.java:476) at org.apache.lucene.index.IndexReader.open(IndexReader.java:403) at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1057) ... 26 more SUGGESTIONS? thanks, alessio Il giorno 25 febbraio 2012 10:52, alessio crisantemi alessio.crisant...@gmail.com ha scritto: thi is the problem! Becaus in my root there is a url! I write you my step-by-step configuration of nutch: (I use cygwin because I work on windows) *1. Extract the Nutch package* *2. Configure Solr* (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) for *to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: *b. Change schema.xml so that the stored attribute of field “content” is true.* *field name=”content” type=”text” stored=”true” indexed=”true”/* We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case: *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it* requestHandler name=/nutch class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf content^0.5 anchor^1.0 title^1.2 /str str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str str name=fl url /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int bool hl=true/ str name=q.alt*:*/str str name=hl.fltitle url content/str str name=f.title.hl.fragsize0/str str name=f.title.hl.alternateFieldtitle/str str name=f.url.hl.fragsize0/str str name=f.url.hl.alternateFieldurl/str str name=f.content.hl.fragmenterregex/str /lst /requestHandler *3. Start Solr* cd apache-solr-1.3.0/example java -jar start.jar *4. Configure Nutch* *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :* ?xml version=1.0?
Re: nutch and solr
thi is the problem! Becaus in my root there is a url! I write you my step-by-step configuration of nutch: (I use cygwin because I work on windows) *1. Extract the Nutch package* *2. Configure Solr* (*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to directory apache-solr-1.3.0/example/solr/conf (override the existing file) for *to allow Solr to create the snippets for search results so we need to store the content in addition to indexing it: *b. Change schema.xml so that the stored attribute of field “content” is true.* *field name=”content” type=”text” stored=”true” indexed=”true”/* We want to be able to tweak the relevancy of queries easily so we’ll create new dismax request handler configuration for our use case: *d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste following fragment to it* requestHandler name=/nutch class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf content^0.5 anchor^1.0 title^1.2 /str str name=pf content^0.5 anchor^1.5 title^1.2 site^1.5 /str str name=fl url /str str name=mm 2lt;-1 5lt;-2 6lt;90% /str int name=ps100/int bool hl=true/ str name=q.alt*:*/str str name=hl.fltitle url content/str str name=f.title.hl.fragsize0/str str name=f.title.hl.alternateFieldtitle/str str name=f.url.hl.fragsize0/str str name=f.url.hl.alternateFieldurl/str str name=f.content.hl.fragmenterregex/str /lst /requestHandler *3. Start Solr* cd apache-solr-1.3.0/example java -jar start.jar *4. Configure Nutch* *a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s contents with the following (we specify our crawler name, active plugins and limit maximum url count for single host per run to be 100) :* ?xml version=1.0? configuration property namehttp.agent.name/name valuenutch-solr-integration/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value /property /configuration *b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace it’s content with following:* -^(https|telnet|file|ftp|mailto): # skip some suffixes -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV| WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png| PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG |bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # allow urls in foofactory.fi domain +^http:*//([a-z0-9\-A-Z]*\.)*google.it/* # deny anything *else* -. *5. Create a seed list (the initial urls to fetch)* mkdir urls *(crea una cartella ‘urls’)* echo http://www.google.it/; urls/seed.txt *6. Inject seed url(s) to nutch crawldb (execute in nutch directory)* bin/nutch inject crawl/crawldb urls AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion? thank you alessio Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in ha scritto: The empty path message is becayse nutch is unable to find a url in the url location that you provide. Kindly ensure there is a url there. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch and solr
The empty path message is becayse nutch is unable to find a url in the url location that you provide. Kindly ensure there is a url there. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch and solr
thanks for your reply, but don't work. the same message: can't convert empty path and more: impossible find class org.apache.nutch.crawl.injector .. Il giorno 22 febbraio 2012 06:14, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in ha scritto: Try this command. bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders name -threads 10 -depth 2 -topN 1000 Your folder structure will look like this: nutch folder-- urls -- folder name-- url file.txt | | -- crawl -- folder name The folder name will be for different domains. So for each domain folder in urls folder there has to be a corresponding folder (with the same name) in the crawl folder. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch and solr
Try this command. bin/nutch crawl urls/folder name/url file.txt -dir crawl/folders name -threads 10 -depth 2 -topN 1000 Your folder structure will look like this: nutch folder-- urls -- folder name-- url file.txt | | -- crawl -- folder name The folder name will be for different domains. So for each domain folder in urls folder there has to be a corresponding folder (with the same name) in the crawl folder. -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3765607.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch in solr
Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 15:18:01 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format SolrDeleteDuplicates: starting at 2012-02-03 15:18:09 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/ Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at
Re: nutch in solr
alessio crisantemi-2, I think you got it.. Check the jars in nutch lib and see if the solr n solrj jars are same... That could be the issue -- View this message in context: http://lucene.472066.n3.nabble.com/nutch-in-solr-tp3716969p3717542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: nutch in solr
no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 15:18:01 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format SolrDeleteDuplicates: starting at 2012-02-03 15:18:09 SolrDeleteDuplicates: Solr url: http://127.0.0.1:8983/solr/ Exception in thread main java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executing query at
Re: nutch in solr
No, they all don't run on 8983. Tomcat's default port is 8080. If you're using the embedded server in SOLR, you are using Jetty, which runs on port 8983. On Sun, Feb 5, 2012 at 11:54 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 15:18:01
Re: nutch in solr
looks like solrj version in nutch classpath is different that the solr version on server, can you post the versions for both nutch and solr? On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/C:/temp/apache-nutch-1.4-bin/runtime/local/crawl-20120203151719/segments/20120203151735 LinkDb: finished at 2012-02-03 15:18:01, elapsed: 00:00:04 SolrIndexer: starting at 2012-02-03 15:18:01 java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format SolrDeleteDuplicates: starting at
Re: nutch in solr
if I look the solr and nuth libs I found: apache-solr-solrj-1.4.1.jar on Solr and solr-solrj-3.4.0.jar this are the only jar files with a word 'solrj' taht's the problem?! 2012/2/5 Geek Gamer geek4...@gmail.com looks like solrj version in nutch classpath is different that the solr version on server, can you post the versions for both nutch and solr? On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting at 2012-02-03 15:17:57 LinkDb: linkdb: crawl-20120203151719/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment:
Re: nutch in solr
solj is the solr java client library, so there seem to be two versions 1.4.1 and 3.4.0, which are incompatible, so you can do the following, refer : https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d to see how to upgrade the solr version in nutch, teh above example replaces solr 1.4.0 with 3.1.0. On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: if I look the solr and nuth libs I found: apache-solr-solrj-1.4.1.jar on Solr and solr-solrj-3.4.0.jar this are the only jar files with a word 'solrj' taht's the problem?! 2012/2/5 Geek Gamer geek4...@gmail.com looks like solrj version in nutch classpath is different that the solr version on server, can you post the versions for both nutch and solr? On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging segment data into db. CrawlDb update: finished at 2012-02-03 15:17:53, elapsed: 00:00:05 Generator: starting at 2012-02-03 15:17:53 Generator: Selecting best-scoring urls due for fetch. Generator:
Re: nutch in solr
tx, I try and write the result asap a. 2012/2/5 Geek Gamer geek4...@gmail.com solj is the solr java client library, so there seem to be two versions 1.4.1 and 3.4.0, which are incompatible, so you can do the following, refer : https://github.com/geek4377/nutch/commit/c66bf35ff4f86393413621b3b889b1c78281df4d to see how to upgrade the solr version in nutch, teh above example replaces solr 1.4.0 with 3.1.0. On Sun, Feb 5, 2012 at 11:02 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: if I look the solr and nuth libs I found: apache-solr-solrj-1.4.1.jar on Solr and solr-solrj-3.4.0.jar this are the only jar files with a word 'solrj' taht's the problem?! 2012/2/5 Geek Gamer geek4...@gmail.com looks like solrj version in nutch classpath is different that the solr version on server, can you post the versions for both nutch and solr? On Sun, Feb 5, 2012 at 10:24 PM, alessio crisantemi alessio.crisant...@gmail.com wrote: no, all run on port 8983. .. 2012/2/5 Matthew Parker mpar...@apogeeintegration.com Doesn't tomcat run on port 8080, and not port 8983? Or did you change the tomcat's default port to 8983? On Feb 5, 2012 5:17 AM, alessio crisantemi alessio.crisant...@gmail.com wrote: Hi All, I have some problems with integration of Nutch in Solr and Tomcat. I follo Nutch tutorial for integration and now, I can crawl a website: all works right. But It I try the solr integration, I can't indexing on Solr. follow the nutch output after the command: bin/nutch crawl urls -solr http://127.0.0.1:8983/solr/ -depth 3 -topN 5 I read java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format MAY BE THERE IS A PROBLEM BETWEEN NUTCH 1.4 VERSION AND SOLR 1.4.1? MAY BE IT REQUIRE A 3.X SOLR VERSION? thanks, a. crawl started in: crawl-20120203151719 rootUrlDir = urls threads = 10 depth = 3 solrUrl=http://127.0.0.1:8983/solr/ topN = 5 Injector: starting at 2012-02-03 15:17:20 Injector: crawlDb: crawl-20120203151719/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-02-03 15:17:31, elapsed: 00:00:10 Generator: starting at 2012-02-03 15:17:31 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl-20120203151719/segments/20120203151735 Generator: finished at 2012-02-03 15:17:39, elapsed: 00:00:07 Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property. Fetcher: starting at 2012-02-03 15:17:39 Fetcher: segment: crawl-20120203151719/segments/20120203151735 Using queue mode : byHost Fetcher: threads: 10 Fetcher: time-out divisor: 2 QueueFeeder finished: total 1 records + hit by time limit :0 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost fetching http://www.gioconews.it/ Using queue mode : byHost -finishing thread FetcherThread, activeThreads=3 -finishing thread FetcherThread, activeThreads=2 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost Using queue mode : byHost -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold: -1 -finishing thread FetcherThread, activeThreads=1 Fetcher: throughput threshold retries: 5 -finishing thread FetcherThread, activeThreads=1 -finishing thread FetcherThread, activeThreads=1 fetch of http://www.gioconews.it/ failed with: java.net.UnknownHostException: www.gioconews.it -finishing thread FetcherThread, activeThreads=0 -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 Fetcher: finished at 2012-02-03 15:17:44, elapsed: 00:00:05 ParseSegment: starting at 2012-02-03 15:17:44 ParseSegment: segment: crawl-20120203151719/segments/20120203151735 ParseSegment: finished at 2012-02-03 15:17:48, elapsed: 00:00:04 CrawlDb update: starting at 2012-02-03 15:17:48 CrawlDb update: db: crawl-20120203151719/crawldb CrawlDb update: segments: [crawl-20120203151719/segments/20120203151735] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: 404 purging: false CrawlDb update: Merging
Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex
you need to update the solrj libs to 3.x version. the java bin format has changed . I made the change a few months back, you can pull the changes from https://github.com/geek4377/nutch/tree/geek5377-1.2.1 hope that helps, On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not built) and tomcat6 following this (and some other) links http://wiki.apache.org/nutch/RunningNutchAndSolr I have added the nutch schema and can access/view this schema via the admin page. nutch also works as I can perfrom successful searches. When I execute the following: ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb crawl/linkdb crawl/segments/* I (eventually) get an io error. Tha above command creates the following files /var/lib/tomcat6/solr/core0/data/index/ --- 544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 _1.fdx 4 -rw-r--r-- 1 tomcat6 tomcat6 32 2011-07-13 10:59 segments_2 4 -rw-r--r-- 1 tomcat6 tomcat6 20 2011-07-13 10:59 segments.gen 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 write.lock --- but the hadoop.log reports the following error --- 2011-07-13 11:09:47,665 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-07-13 11:09:47,666 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: content dest: content 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: site dest: site 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: title dest: title 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: host dest: host 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: segment dest: segment 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: boost dest: boost 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: digest dest: digest 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: id 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: url 2011-07-13 11:09:49,272 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:54) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask $3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:159) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:216) 2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException: Job failed! --- I'd appreciate any help with this. Thanks, Leo
Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex
Works like a charm. Thanks, Leo On Wed, 2011-07-13 at 11:31 +0530, Geek Gamer wrote: you need to update the solrj libs to 3.x version. the java bin format has changed . I made the change a few months back, you can pull the changes from https://github.com/geek4377/nutch/tree/geek5377-1.2.1 hope that helps, On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not built) and tomcat6 following this (and some other) links http://wiki.apache.org/nutch/RunningNutchAndSolr I have added the nutch schema and can access/view this schema via the admin page. nutch also works as I can perfrom successful searches. When I execute the following: ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb crawl/linkdb crawl/segments/* I (eventually) get an io error. Tha above command creates the following files /var/lib/tomcat6/solr/core0/data/index/ --- 544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 _1.fdx 4 -rw-r--r-- 1 tomcat6 tomcat6 32 2011-07-13 10:59 segments_2 4 -rw-r--r-- 1 tomcat6 tomcat6 20 2011-07-13 10:59 segments.gen 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 write.lock --- but the hadoop.log reports the following error --- 2011-07-13 11:09:47,665 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-07-13 11:09:47,666 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: content dest: content 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: site dest: site 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: title dest: title 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: host dest: host 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: segment dest: segment 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: boost dest: boost 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: digest dest: digest 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: id 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: url 2011-07-13 11:09:49,272 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:54) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask $3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:159) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:216) 2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException: Job failed! --- I'd appreciate any help with this. Thanks, Leo
Re: nutch 1.2, solr 3.3, tomcat6. java.io.IOException: Job failed! problem when building solrindex
If you're using Solr anyway, you'd better upgrade to Nutch 1.3 with Solr 3.x support. Works like a charm. Thanks, Leo On Wed, 2011-07-13 at 11:31 +0530, Geek Gamer wrote: you need to update the solrj libs to 3.x version. the java bin format has changed . I made the change a few months back, you can pull the changes from https://github.com/geek4377/nutch/tree/geek5377-1.2.1 hope that helps, On Wed, Jul 13, 2011 at 8:58 AM, Leo Subscriptions llsub...@zudiewiener.com wrote: I'm running 64bit Ubuntu 11.04, nutch 1.2, solr 3.3 (downloaded, not built) and tomcat6 following this (and some other) links http://wiki.apache.org/nutch/RunningNutchAndSolr I have added the nutch schema and can access/view this schema via the admin page. nutch also works as I can perfrom successful searches. When I execute the following: ./bin/nutch solrindex http://localhost:8080/solr/core0 crawl/crawldb crawl/linkdb crawl/segments/* I (eventually) get an io error. Tha above command creates the following files /var/lib/tomcat6/solr/core0/data/index/ --- 544 -rw-r--r-- 1 tomcat6 tomcat6 557056 2011-07-13 11:09 _1.fdt 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 _1.fdx 4 -rw-r--r-- 1 tomcat6 tomcat6 32 2011-07-13 10:59 segments_2 4 -rw-r--r-- 1 tomcat6 tomcat6 20 2011-07-13 10:59 segments.gen 0 -rw-r--r-- 1 tomcat6 tomcat6 0 2011-07-13 11:00 write.lock --- but the hadoop.log reports the following error --- 2011-07-13 11:09:47,665 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2011-07-13 11:09:47,666 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: content dest: content 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: site dest: site 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: title dest: title 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: host dest: host 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: segment dest: segment 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: boost dest: boost 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: digest dest: digest 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: id 2011-07-13 11:09:47,690 INFO solr.SolrMappingReader - source: url dest: url 2011-07-13 11:09:49,272 WARN mapred.LocalJobRunner - job_local_0001 java.lang.RuntimeException: Invalid version or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99 ) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse( BinaryResponseParser.java:39) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons HttpSolrServer.java:466) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(Commons HttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(Abst ractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.write(SolrWriter.java:64) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:54) at org.apache.nutch.indexer.IndexerOutputFormat $1.write(IndexerOutputFormat.java:44) at org.apache.hadoop.mapred.ReduceTask $3.collect(ReduceTask.java:440) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 159) at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java: 50) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner $Job.run(LocalJobRunner.java:216) 2011-07-13 11:09:49,611 ERROR solr.SolrIndexer - java.io.IOException: Job failed! --- --- - I'd appreciate any help with this. Thanks, Leo
Re: Nutch and Solr search on the fly
The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Are you using the depth parameter with the crawl command or are you using the separate generate, fetch etc. commands? What's $ nutch readdb crawldb -stats returning? On Wednesday 09 February 2011 15:06:40 .: Abhishek :. wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(which in turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.iowrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.comwrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Hi Abishek, depth is a param of crawl command, not fetch command If you are using custom script calling individual stages of nutch crawl, then depth N means , you running that script for N times.. You can put a loop, in the script. Thanks, Charan On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com wrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Nutch and Solr search on the fly
Hi Charan, Thanks for the clarifications. The link I have been referring to( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) does not say anything about using the crawl? Do I have to do it after the last step mentioned? Thanks, Abi On Thu, Feb 10, 2011 at 12:58 AM, charan kumar charan.ku...@gmail.comwrote: Hi Abishek, depth is a param of crawl command, not fetch command If you are using custom script calling individual stages of nutch crawl, then depth N means , you running that script for N times.. You can put a loop, in the script. Thanks, Charan On Wed, Feb 9, 2011 at 6:26 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Erick, Thanks a bunch for the response Could be a chance..but all I am wondering is where to specify the depth in the whole entire process in the URL http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/? I tried specifying it during the fetcher phase but it was just ignored :( Thanks, Abi On Wed, Feb 9, 2011 at 10:11 PM, Erick Erickson erickerick...@gmail.com wrote: WARNING: I don't do Nutch much, but could it be that your crawl depth is 1? See: http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchTutorialand search for depth Best Erick On Wed, Feb 9, 2011 at 9:06 AM, .: Abhishek :. ab1s...@gmail.com wrote: Hi Markus, I am sorry for not being clear, I meant to say that... Suppose if a url namely www.somehost.com/gifts/greetingcard.html(whichhttp://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28which http://www.somehost.com/gifts/greetingcard.html%28whichin turn contain links to a.html, b.html, c.html, d.html) is injected into the seed.txt, after the whole process I was expecting a bunch of other pages which crawled from this seed url. However, at the end of it all I see is the contents from only this page namely www.somehost.com/gifts/greetingcard.htmland I do not see any other pages(here a.html, b.html, c.html, d.html) crawled from this one. The crawling happens only for the URLs mentioned in the seed.txt and does not proceed further from there. So I am just bit confused. Why is it not crawling the linked pages(a.html, b.html, c.html and d.html). I get a feeling that I am missing something that the author of the blog( http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/) assumed everyone would know. Thanks, Abi On Wed, Feb 9, 2011 at 7:09 PM, Markus Jelsma markus.jel...@openindex.io wrote: The parsed data is only sent to the Solr index of you tell a segment to be indexed; solrindex crawldb linkdb segment If you did this only once after injecting and then the consequent fetch,parse,update,index sequence then you, of course, only see those URL's. If you don't index a segment after it's being parsed, you need to do it later on. On Wednesday 09 February 2011 04:29:44 .: Abhishek :. wrote: Hi all, I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :) I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible. I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. By this process, I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. I think somewhere in the process there should be a crawling happening and I missed it out. Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience. Cheers, Abi -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: [Nutch] and Solr integration
All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Nutch] and Solr integration
BLEH! facepalm This is entirely possible to do in a single step AS LONG AS YOU GET THE SYNTAX CORRECT ;-) http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/ http://www.lucidimagination.com/blog/2010/09/10/refresh-using-nutch-with-solr/bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50* -solr* http://localhost:8983/solr http://localhost:8983/solrThe correct param is -solr NOT -solrindex. Cheers, Adam On Mon, Jan 3, 2011 at 11:45 AM, Adam Estrada estrada.a...@gmail.comwrote: All, I realize that the documentation says that you crawl first then add to Solr but I spent several hours running the same command through Cygwin with -solrindex http://localhost:8983/solr on the command line (eg. bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr) and it worked. Does anyone know why it's not working for me anymore? I am using the Lucid build of Solr which was what i was using before. I neglected to write down the command line syntax which is biting me in the arse. Any tips on this one would be great! Thanks, Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Nutch] and Solr integration
why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY=. -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: [Nutch] and Solr integration
bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex http://localhost:8983/solr I've run that command before and it worked...that's why I asked. grab nutch from trunk and run bin/nutch and see that it is in fact an option. It looks like Hadoop is the culprit now and I am at a loss on how to fix it. Thanks for the feedback. Adam On Mon, Dec 20, 2010 at 4:21 PM, Anurag anurag.it.jo...@gmail.com wrote: why are using solrindex in the argument.? It is used when we need to index the crawled data in Solr For more read http://wiki.apache.org/nutch/NutchTutorial . Also for nutch-solr integration this is very useful blog http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ I integrated nutch and solr and it works well. Thanks On Tue, Dec 21, 2010 at 1:57 AM, Adam Estrada-2 [via Lucene] ml-node+2122347-622655030-146...@n3.nabble.comml-node%2b2122347-622655030-146...@n3.nabble.com ml-node%2b2122347-622655030-146...@n3.nabble.comml-node%252b2122347-622655030-146...@n3.nabble.com wrote: All, I have a couple websites that I need to crawl and the following command line used to work I think. Solr is up and running and everything is fine there and I can go through and index the site but I really need the results added to Solr after the crawl. Does anyone have any idea on how to make that happen or what I'm doing wrong? These errors are being thrown fro Hadoop which I am not using at all. $ bin/nutch crawl urls -dir crawl -threads 10 -depth 100 -topN 50 -solrindex ht tp://localhost:8983/solr crawl started in: crawl rootUrlDir = http://localhost:8983/solr threads = 10 depth = 100 indexer=lucene topN = 50 Injector: starting at 2010-12-20 15:23:25 Injector: crawlDb: crawl/crawldb Injector: urlDir: http://localhost:8983/solr Injector: Converting injected urls to crawl db entries. Exception in thread main java.io.IOException: No FileSystem for scheme: http at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1375 ) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:169) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 81) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:217) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) -- View message @ http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122347.html To start a new topic under Solr - User, email ml-node+472068-1941297125-146...@n3.nabble.comml-node%2b472068-1941297125-146...@n3.nabble.com ml-node%2b472068-1941297125-146...@n3.nabble.comml-node%252b472068-1941297125-146...@n3.nabble.com To unsubscribe from Solr - User, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXw0NzIwNjh8LTIwOTgzNDQxOTY= . -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-and-Solr-integration-tp2122347p2122623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Nutch with SOLR
On 9/26/07, Brian Whitman [EMAIL PROTECTED] wrote: Sami has a patch in there which used a older version of the solr client. with the current solr client in the SVN tree, his patch becomes much easier. your job would be to upgrade the patch and mail it back to him so he can update his blog, or post it as a patch for inclusion in nutch/contrib (if sami is ok with that). If you have issues with how to use the solr client api, solr-user is here to help. I've done this. Apparently someone else has taken on the solr-nutch job and made it a bit more complicated (which is good for the long term) than sami's original patch -- https://issues.apache.org/jira/ browse/NUTCH-442 That someone else is me :) NUTCH-442 is one of the issues that I want to really see resolved. Unfortunately, I haven't received many (as in, none) comments, so I haven't made further progress on it. Patch at NUTCH-442 tries to integrate SOLR in a way that it is a first-class citizen (so to speak), so that you can index to solr or to lucene within the same Indexer job (or both), retrieve search results from a solr server or from nutch's home-grown index servers in nutch's web UI (or a combination of both). And I think patch lays the ground work for generating summaries from solr. But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... -b -- Doğacan Güney
Re: Nutch with SOLR
On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote: NUTCH-442 is one of the issues that I want to really see resolved. Unfortunately, I haven't received many (as in, none) comments, so I haven't made further progress on it. I am probably your target customer but to be honest all we care about is using Solr to index, not for any of the searching or summary stuff in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk (now that it's working OK) sooner than later and keep working on NUTCH-442 as well? Do they conflict? -b
Re: Nutch with SOLR
[moving this thread to solr-user, as it really has nothing to do with hadoop] Daniel Clark wrote: There's info on website http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.htm l, but it's not clear. Sami has a patch in there which used a older version of the solr client. with the current solr client in the SVN tree, his patch becomes much easier. your job would be to upgrade the patch and mail it back to him so he can update his blog, or post it as a patch for inclusion in nutch/contrib (if sami is ok with that). If you have issues with how to use the solr client api, solr-user is here to help. the nutch specific changes are: 1. configure nutch-site.xml to add a config option to point to your solr server. 2. instead of calling the nutch 'index' command, you would call it like so bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb $SEGMENT regards Ian ~ Daniel Clark, President DAC Systems, Inc. (703) 403-0340 ~ -Original Message- From: Dmitry [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 25, 2007 2:56 PM To: [EMAIL PROTECTED] Subject: Re: Nutch with SOLR Daniel, We just started to test/research posibility of integration of Nutch and Solr so it will be nice to hear any advices as well. Thanks, DT www.ejizn.com - Original Message - From: Daniel Clark [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, September 25, 2007 1:23 PM Subject: Nutch with SOLR Has anyone been able to get Nutch 0.9 working with SOLR? Any help would be appreciated. ~ Daniel Clark, President DAC Systems, Inc. (703) 403-0340 ~
Re: Nutch with SOLR
Sami has a patch in there which used a older version of the solr client. with the current solr client in the SVN tree, his patch becomes much easier. your job would be to upgrade the patch and mail it back to him so he can update his blog, or post it as a patch for inclusion in nutch/contrib (if sami is ok with that). If you have issues with how to use the solr client api, solr-user is here to help. I've done this. Apparently someone else has taken on the solr-nutch job and made it a bit more complicated (which is good for the long term) than sami's original patch -- https://issues.apache.org/jira/ browse/NUTCH-442 But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... -b
Re: Nutch with SOLR
But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... I put my files up here: http://variogr.am/latest/?p=26 -b
Re: Nutch with SOLR
Thanks Brian. I'm sure this will help lots of people. Brian Whitman wrote: But we still use a version of Sami's patch that works on both trunk nutch and trunk solr (solrj.) I sent my changes to sami when we did it, if you need it let me know... I put my files up here: http://variogr.am/latest/?p=26 -b