Re: How to index PDF file stored in SQL Server 2008
Hi, all Thank YOU very much for your kindly help. *1. I have upgrade from Solr 1.4 to Solr 3.1* *2. Change data-config-sql.xml * dataConfig dataSource type=JdbcDataSource name=*bsds* driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager user=username password=pw/ datasource name=*docds* type=*BinURLDataSource* / document name=docs entity name=*doc* dataSource=*bsds* query=select id,attachment,filename from attachment where ext='pdf' and id30001030 field column=id name=id / *entity dataSource=docds processor=TikaEntityProcessor url=${doc.attachment} format=text ** field column=attachment name=bs_attachment / /entity* field column=filename name=title / /entity /document /dataConfig *3. solrconfig.xml and schema.xml are NOT changed.* However, when I access *http://localhost:8080/solr/dataimport?command=full-import* It still has errors: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query:[B@ae1393 Processing Document # 1 Could you give me some advices. This problem is so boring me. Thanks. -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 5:16 AM, Lance Norskog goks...@gmail.com wrote: You have to upgrade completely to the Apache Solr 3.1 release. It is worth the effort. You cannot copy any jars between Solr releases. Also, you cannot copy over jars from newer Tika releases. On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman darxo...@gmail.com wrote: Hi again what you are missing is field mapping field column=id name=id / no need for TikaEntityProcessor since you are not accessing pdf files -- Lance Norskog goks...@gmail.com
Clustering with grouping
hi we use solr trunk nightly 4.0. We grouped our results with no problem. When we try to clustering these with this clustering?q=rosegroup=truegroup.field=site we get 500 error. Problem accessing /solr/clustering. Reason: null java.lang.NullPointerException at org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:231) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:245) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1290) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) here solrconfig clustering part: true stc true title url content,url true false dismax explicit 0.01 content^0.5 anchor^1.0 title^1.2 content^0.5 anchor^1.5 title^1.2 site^1.5 recip(date,1,1000,1000)^0.3 2-1 5-2 690% 100 *:* 100 score clustering -- View this message in context: http://lucene.472066.n3.nabble.com/Clustering-with-grouping-tp2805496p2805496.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing Best Practice
Hi guys I'm wondering how to best configure solr to fulfills my requirements. I'm indexing data from 2 data sources: 1- Database 2- PDF files (password encrypted) Every file has related information stored in the database. Both the file content and the related database fields must be indexed as one document in solr. Among the DB data is *per-user* permissions for every document. The file contents nearly never change, on the other hand, the DB data and especially the permissions change very frequently which require me to re-index everything for every modified document. My problem is in process of decrypting the PDF files before re-indexing them which takes too much time for a large number of documents, it could span to days in full re-indexing. What I'm trying to accomplish is eliminating the need to re-index the PDF content if not changed even if the DB data changed. I know this is not possible in solr, because solr doesn't update documents. So how to best accomplish this: Can I use 2 indexes one for PDF contents and the other for DB data and have a common id field for both as a link between them, *and results are treated as one Document*?
Re: How to index PDF file stored in SQL Server 2008
Hi, I have copied \apache-solr-3.1.0\dist\apache-solr-dataimporthandler-extras-3.1.0.jar into \apache-tomcat-6.0.32\webapps\solr\WEB-INF\lib\ Other Errors: Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Unclosed quotation mark after the character string 'B@3e574'. -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 2:12 PM, Darx Oman darxo...@gmail.com wrote: Hi there Error is not clear... but did you copy apache-solr-dataimporthandler-extras-4.0-SNAPSHOT.jar to your solr\lib ?
Re: Tika, Solr running under Tomcat 6 on Debian
Hi All, I have the same issue. I have installed solr instance on tomcat6. When try to index pdf I am running into the below exception: 11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.ClassNotFoundException: org.apache.tika.exception.TikaException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more I could not found any tika jar file. Could you please help me out in fixing the above issue. Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.1 performance compared to 1.4.1
Hi Yonik ! Thanks for your reply. I decided to switch to 3.1 and see if the performance would settle down after building up a proper index. Looking at the average response time from both installations i can see that 3.1 is now actually performing much better than 1.4.1 (1.4.1 shows an average of 43ms, 3.1 shows 32ms) My earlier test (with new keywords) now shows that 3.1 also outperforms 1.4.1 with keywords which have not yet been queried. For the record, the tests are ran on ubuntu 10.04 (8GB ram, Quad core, software raid 1). Ive given both installations a jvm with 1GB of ram. Ive unpacked a new installation of 3.1 besides 1.4.1, and copied in the (in my case) missing parts of configuration (dataimporter, sql xml config and schema additions). Cheers ! Marius 2011/4/10 Yonik Seeley yo...@lucidimagination.com On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt pionw...@gmail.com wrote: Hello ! I'm new to the list, have been using SOLR for roughly 6 months and love it. Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation (Ubuntu server, same JVM params). I have copied the configuration from 1.4.1 to the 3.1. Both version are running fine, but one thing ive noticed, is that the QTime on 3.1, is much slower for initial searches than on the (currently production) 1.4.1 installation. For example: Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime returns 371 Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime returns 59 Using debugQuery=true, i can see that the main time is spend in the query component itself (org.apache.solr.handler.component.QueryComponent). Can someone explain this, and how can i analyze this further ? Does it take time to build up a decent query, so could i switch to 3.1 without having to worry ? Thanks for the report... there's no reason that anything should really be much slower, so it would be great to get to the bottom of this! Is this using the same index as the 1.4.1 server, or did you rebuild it? Are there any other query parameters (that are perhaps added by default, like faceting or anything else that could take up time) or is this truly just a term query? What platform are you on? I believe the Lucene Directory implementation now tries to be smarter (compared to lucene 2.9) about picking the best default (but it may not be working out for you for some reason). -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco
Re: How to index PDF file stored in SQL Server 2008
I changed data-config-sql.xml to dataConfig dataSource type=JdbcDataSource name=bsds driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager user=username password=pw convertType=true / document name=docs entity name=doc dataSource=bsds query=select id,filename,attachment from attachment where ext='pdf' and id=3632 field column=id name=id / field column=filename name=title / field column=attachment name=bs_attachment / /entity /document /dataConfig There are no errors, but, the indexed pdf is convert to Numbers.. 200 1 202 1 203 1 212 1 222 1 236 1 242 1 244 1 254 1 255 -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 2:02 PM, Roy Liu liuchua...@gmail.com wrote: Hi, all Thank YOU very much for your kindly help. *1. I have upgrade from Solr 1.4 to Solr 3.1* *2. Change data-config-sql.xml * dataConfig dataSource type=JdbcDataSource name=*bsds* driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost:1433;databaseName=bs_docmanager user=username password=pw/ datasource name=*docds* type=*BinURLDataSource* / document name=docs entity name=*doc* dataSource=*bsds* query=select id,attachment,filename from attachment where ext='pdf' and id30001030 field column=id name=id / *entity dataSource=docds processor=TikaEntityProcessor url=${doc.attachment} format=text ** field column=attachment name=bs_attachment / /entity* field column=filename name=title / /entity /document /dataConfig *3. solrconfig.xml and schema.xml are NOT changed.* However, when I access *http://localhost:8080/solr/dataimport?command=full-import* It still has errors: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query:[B@ae1393 Processing Document # 1 Could you give me some advices. This problem is so boring me. Thanks. -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 5:16 AM, Lance Norskog goks...@gmail.com wrote: You have to upgrade completely to the Apache Solr 3.1 release. It is worth the effort. You cannot copy any jars between Solr releases. Also, you cannot copy over jars from newer Tika releases. On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman darxo...@gmail.com wrote: Hi again what you are missing is field mapping field column=id name=id / no need for TikaEntityProcessor since you are not accessing pdf files -- Lance Norskog goks...@gmail.com
Re: Tika, Solr running under Tomcat 6 on Debian
\apache-solr-3.1.0\contrib\extraction\lib\tika*.jar -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 3:10 PM, Mike satish01sud...@gmail.com wrote: Hi All, I have the same issue. I have installed solr instance on tomcat6. When try to index pdf I am running into the below exception: 11 Apr, 2011 12:11:55 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:359) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:240) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.ClassNotFoundException: org.apache.tika.exception.TikaException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more I could not found any tika jar file. Could you please help me out in fixing the above issue. Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1
Hi Lance, your are right: XPathEntityProcessor has the attribut xsl, so I can use xslt to generate a xml-File in the form of the standard Solr update schema. I will check the performance of this. Best regards Karsten btw. flatten is an attribute of the field-Tag, not of XPathEntityProcessor (like wrongly specified it the wiki) Lance There is an option somewhere to use the full XML DOM implementation for using xpaths. The purpose of the XPathEP is to be as simple and dumb as possible and handle most cases: RSS feeds and other open standards. Search for xsl(optional) http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1 Karsten On Sat, Apr 9, 2011 at 5:32 AM Hi Folks, does anyone improve DIH XPathRecordReader to deal with nested xpaths? e.g. data-config.xml with entity .. processor=XPathEntityProcessor .. field column=title xpath=//body/h1/ field column=alltext” xpath=//body flatten=true/ and the XML stream contains /html/body/h1... will only fill field “alltext” but field “title” will be empty. This is a known issue from 2009 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose So three questions: 1. How to fill a “search over all”-Field without nested xpaths? (schema.xml copyField source=* dest=alltext/ will not help, because we lose the original token order) 2. Does anyone try to improve XPathRecordReader to deal with nested xpaths? 3. Does anyone else need this feature? Best regards Karsten http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
RE: Solr under Tomcat
Hi All, I have installed solr instance on tomcat6. When i tried to index the PDF file i was able to see the response: 0 479 Query: http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true But when i tried to search the content in the pdf i could not get any results: 0 2 − on 0 struts 10 2.2 Could you please let me know if I am doing anything wrong. It works fine when i tried with default jetty server prior to integrating on the tomcat6. I have followed installation steps from http://wiki.apache.org/solr/SolrTomcat (Tomcat on Windows Single Solr app). Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Tika, Solr running under Tomcat 6 on Debian
Hi Roy, Thank you for the quick reply. When i tried to index the PDF file i was able to see the response: 0 479 Query: http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true But when i tried to search the content in the pdf i could not get any results: 0 2 − on 0 struts 10 2.2 Could you please let me know if I am doing anything wrong. It works fine when i tried with default jetty server prior to integrating on the tomcat6. I have followed installation steps from http://wiki.apache.org/solr/SolrTomcat (Tomcat on Windows Single Solr app). Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Jayendra, Thanks for the info - been keeping an eye on this list in case this topic cropped up again. It's currently a background task for me, so I'll try and take a look at the patches and re-test soon. Joey - glad you brought this issue up again. I haven't progressed any further with it. I've not yet moved to Solr 3.1 but it's on my to-do list, as is testing out the patches referenced by Jayendra. I'll post my findings on this thread - if you manage to test the patches before me, let me know how you get on. Thanks and kind regards, Gary. On 11/04/2011 05:02, Jayendra Patil wrote: The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzelphan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true; -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. -- Gary Taylor INOVEM Tel +44 (0)1488 648 480 Fax +44 (0)7092 115 933 gary.tay...@inovem.com www.inovem.com INOVEM Ltd is registered in England and Wales No 4228932 Registered Office 1, Weston Court, Weston, Berkshire. RG20 8JE
Spellchecker with synonyms
Hello, I have some synonyms for city names. Sometimes there are multiple names for one city, example:. newyork, newyork city, big apple I search for big apple and get results with new york(synonym) If somebody search for big aple i want a spelling suggestion like: big apple. How can i fix that synonyms are available for the spellchecker? -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellchecker with synonyms
Did you configure synonyms for your field at query time ? Ludovic. 2011/4/11 royr [via Lucene] ml-node+2806028-1349039134-383...@n3.nabble.com Hello, I have some synonyms for city names. Sometimes there are multiple names for one city, example:. newyork, newyork city, big apple I search for big apple and get results with new york(synonym) If somebody search for big aple i want a spelling suggestion like: big apple. How can i fix that synonyms are available for the spellchecker? -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806028.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellchecker with synonyms
Yes, it looks like this: will work on query and index time i think. -- View this message in context: http://lucene.472066.n3.nabble.com/Spellchecker-with-synonyms-tp2806028p2806157.html Sent from the Solr - User mailing list archive at Nabble.com.
XML not coming through from nabble to Gmail
All: Lately I've been seeing a lot of posts where people paste in parts of their schema.xml or solrconfig.xml and the results are...er...disappointing. None of the less-than or greater-than symbols show and the formatting is all over the map. Since some mails would come through with the XML formatted and some would be wonky, at first I thought it was the sender, but then a pretty high percentage came through this way. So I poked around and it seems to only be the case that the XML is wonkified (tm) when it's comes to Gmail from nabble, the original post on nabble has the markup and displays fine. Behavior is the same in Chrome and Firefox BTW. Does anyone have any insight into this? Time to complain to the nabble folks? Do others see this with non-Gmail relays? Thanks, Erick
Can I set up a config-based distributed search
In the Distributed Search page ( http://wiki.apache.org/solr/DistributedSearch), it is documented that in order to perform a distributed search over a sharded index, I should use the shards request parameter, listing the shards to participate in the search (e.g. ?shards=localhost:8983/solr,localhost:7574/solr). I am planning a new pretty large index (1B+ items). Say I have a 100 shards, specifying the shards on the request URL becomes unrealistic due to length of URL. It is also redundant to do that on every request. Is there a way to specify the list of shards in a configuration file, instead of on the query URL? I have seen references to relevant config in SolrCloud, but as I understand it planned to be released only in Solr 4.0. Thanks, Ran
Re: ArrayIndexOutOfBoundsException with facet query
Tom, I think I see where this may be -- it looks like another 2B terms bug in Lucene (we are using an int instead of a long in the TermInfoAndOrd class inside TermInfosReader.java), only present in 3.1. I'm also mad that Test2BTerms fails to catch this!! I will go fix that test and confirm it sees this bug. Can you build from source? If so, try this patch: Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java === --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java (revision 1089906) +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -46,8 +46,8 @@ // Just adds term's ord to TermInfo private final static class TermInfoAndOrd extends TermInfo { -final int termOrd; -public TermInfoAndOrd(TermInfo ti, int termOrd) { +final long termOrd; +public TermInfoAndOrd(TermInfo ti, long termOrd) { super(ti); this.termOrd = termOrd; } @@ -245,7 +245,7 @@ // wipe out the cache when they iterate over a large numbers // of terms in order if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; @@ -262,7 +262,7 @@ // random-access: must seek final int indexPos; if (tiOrd != null) { - indexPos = tiOrd.termOrd / totalIndexInterval; + indexPos = (int) (tiOrd.termOrd / totalIndexInterval); } else { // Must do binary search: indexPos = getIndexOffset(term); @@ -274,7 +274,7 @@ if (enumerator.term() != null term.compareTo(enumerator.term()) == 0) { ti = enumerator.termInfo(); if (tiOrd == null) { -termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); +termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; Mike http://blog.mikemccandless.com On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote: The query below results in an array out of bounds exception: select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr Here is the exception: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) We are using a dev version of Solr/Lucene: Solr Specification Version: 3.0.0.2010.11.19.16.00.54 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54 Lucene Specification Version: 3.1-SNAPSHOT Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Just before the exception we see this entry in our tomcat logs: Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0} Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute Is this a known bug? Can anyone provide a clue as to how we can determine what the problem is? Tom Burton-West Appended Below is the exception stack trace: SEVERE: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271) at org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338) at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928) at org.apache.lucene.index.DirectoryReader$MultiTermEnum.init(DirectoryReader.java:1055) at org.apache.lucene.index.DirectoryReader.terms(DirectoryReader.java:659) at org.apache.solr.search.SolrIndexReader.terms(SolrIndexReader.java:302) at org.apache.solr.request.NumberedTermEnum.skipTo(UnInvertedField.java:1018) at org.apache.solr.request.UnInvertedField.getTermText(UnInvertedField.java:838) at org.apache.solr.request.UnInvertedField.getCounts(UnInvertedField.java:617) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:279) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:312) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:174) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
Re: Can I set up a config-based distributed search
You can add to your search handler the shards parameter : requestHandler name=dist-search class=solr.SearchHander lst name default str name=shards host1/solr, host2/solrstr/ /lst /requestHandler Is is what you are looking for ? Ludovic. 2011/4/11 Ran Peled [via Lucene] ml-node+2806331-346788257-383...@n3.nabble.com In the Distributed Search page ( http://wiki.apache.org/solr/DistributedSearch), it is documented that in order to perform a distributed search over a sharded index, I should use the shards request parameter, listing the shards to participate in the search (e.g. ?shards=localhost:8983/solr,localhost:7574/solr). I am planning a new pretty large index (1B+ items). Say I have a 100 shards, specifying the shards on the request URL becomes unrealistic due to length of URL. It is also redundant to do that on every request. Is there a way to specify the list of shards in a configuration file, instead of on the query URL? I have seen references to relevant config in SolrCloud, but as I understand it planned to be released only in Solr 4.0. Thanks, Ran -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806331.html To start a new topic under Solr - User, email ml-node+472068-1765922688-383...@n3.nabble.com To unsubscribe from Solr - User, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=472068code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=. - Jouve France. -- View this message in context: http://lucene.472066.n3.nabble.com/Can-I-set-up-a-config-based-distributed-search-tp2806331p2806763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to create multiple doc using DIH and access the data pertaining to a particular doc name ?
Hi All, I am new to solr. I want to implement solr search. I have to implement two search buttons(1. books and 2. computers and both are in the same datasource) which are completely different there is no relation between each other. Could you please let know how to define the entities in data-config.xml and also on schema.xml. Is it possible to do something like: Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-create-multiple-doc-using-DIH-and-access-the-data-pertaining-to-a-particular-doc-n-tp1877203p2806787.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing Best Practice
If it's of any help I've split the processing of PDF files from the indexing. I put the PDF content into a text file (but I guess you could load it into a database) and use that as part of the indexing. My processing of the PDF files also compares timestamps on the document and the text file so that I'm only processing documents that have changed. I am a newbie so perhaps there's more sophisticated approaches. Hope that helps. Shaun On 11 April 2011 07:20, Darx Oman darxo...@gmail.com wrote: Hi guys I'm wondering how to best configure solr to fulfills my requirements. I'm indexing data from 2 data sources: 1- Database 2- PDF files (password encrypted) Every file has related information stored in the database. Both the file content and the related database fields must be indexed as one document in solr. Among the DB data is *per-user* permissions for every document. The file contents nearly never change, on the other hand, the DB data and especially the permissions change very frequently which require me to re-index everything for every modified document. My problem is in process of decrypting the PDF files before re-indexing them which takes too much time for a large number of documents, it could span to days in full re-indexing. What I'm trying to accomplish is eliminating the need to re-index the PDF content if not changed even if the DB data changed. I know this is not possible in solr, because solr doesn't update documents. So how to best accomplish this: Can I use 2 indexes one for PDF contents and the other for DB data and have a common id field for both as a link between them, *and results are treated as one Document*?
Reloading synonyms.txt without downtime
Hi, Apparently, when one RELOADs a core, the synonyms file is not reloaded. Is this the expected behaviour? Is it the desired behaviour? Here's the use-case: When one is doing purely query-time synonym expansion, ideally one would be able to edit synonyms.txt and get it reloaded, so that the changes can start taking effect immediately. One might think that RELOADing a Solr core would achieve this, but apparently this doesn't happen. Should it? Are there technical reasons why RELOADing a core should not reload the synonyms file? (other than if synonyms are used at index-time, changing the synonyms would mean that one has to reindex old docs in order for changes to synonyms to apply to old docs). Issue https://issues.apache.org/jira/browse/SOLR-1307 mentions this a bit, but doesn't go in a lot of depth. Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Can I set up a config-based distributed search
I have not worked with shards/distributed, but I think you can probably specify them as defaults in your requesthandler in your solrconfig.xml instead. Somewhere there is (or was) a wiki page on this I can't find right now. There's a way to specify (for a particular request handler) a default parameter value, such as for 'shards', that will be used if none were given with the request. There's also a way to specify an invariant that will always be used even if something else is passed in on the request. Ah, found it: http://wiki.apache.org/solr/SearchHandler#Configuration On 4/11/2011 8:31 AM, Ran Peled wrote: In the Distributed Search page ( http://wiki.apache.org/solr/DistributedSearch), it is documented that in order to perform a distributed search over a sharded index, I should use the shards request parameter, listing the shards to participate in the search (e.g. ?shards=localhost:8983/solr,localhost:7574/solr). I am planning a new pretty large index (1B+ items). Say I have a 100 shards, specifying the shards on the request URL becomes unrealistic due to length of URL. It is also redundant to do that on every request. Is there a way to specify the list of shards in a configuration file, instead of on the query URL? I have seen references to relevant config in SolrCloud, but as I understand it planned to be released only in Solr 4.0. Thanks, Ran
Re: Performance with search terms starting and ending with wildcards
Hi, Perhaps you should give Lucene/Solr trunk a try and compare! The Wildcard query in trunk should be much faster. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Ueland tor.henn...@gmail.com To: solr-user@lucene.apache.org Sent: Sun, April 10, 2011 10:44:46 AM Subject: Performance with search terms starting and ending with wildcards Hi! I have been doing some testing with solr and wildcards. Queries like: - *foo - foo* Does complete quickly(1-2s) in a test index on about 40-50GB. But when i try to do a search for *foo*, the search time can without any trouble come upwards for 30seconds plus. Any ideas on how that issue can be worked around? One fix would be to change *foo* to (*foo or foo* or oof* or *oof) (is the reverse even needed?). But that will not give the same results as *foo*, logicly enough. I have also tried to set maxTimeAllowed, but that is simply ignored. I guess that is related to either sorting or the wildcard search itself. -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-with-search-terms-starting-and-ending-with-wildcards-tp2802561p2802561.html Sent from the Solr - User mailing list archive at Nabble.com.
Clarifying fetchindex command
Hi, Can one actually *force* replication of the index from the master without a commit being issued on the master since the last replication? I do see Force a fetchindex on slave from master command: http://slave_host:port/solr/replication?command=fetchindex; on http://wiki.apache.org/solr/SolrReplication#HTTP_API, but that feels more like force the replication *now* instead of waiting for the slave to poll the master than force the replication even if there is no new commit point and no new index version on the master. Which one is it, really? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: ArrayIndexOutOfBoundsException with facet query
Thanks Mike, At first I thought this couldn't be related to the 2.1 Billion terms issue since the only place we have tons of terms is in the OCR field and this is not the OCR field. But then I remembered that the total number of terms in all fields is what matters. We've had no problems with regular searches against the index or with other facet queries. Only with this facet. Is TermInfoAndOrd only used for faceting? I'll go ahead and build the patch and let you know. Tom p.s. Here is the field definition: field name=topicStr type=string indexed=true stored=false multiValued=true/ fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 8:40 AM To: solr-user@lucene.apache.org Cc: Burton-West, Tom Subject: Re: ArrayIndexOutOfBoundsException with facet query Tom, I think I see where this may be -- it looks like another 2B terms bug in Lucene (we are using an int instead of a long in the TermInfoAndOrd class inside TermInfosReader.java), only present in 3.1. I'm also mad that Test2BTerms fails to catch this!! I will go fix that test and confirm it sees this bug. Can you build from source? If so, try this patch: Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java === --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java (revision 1089906) +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -46,8 +46,8 @@ // Just adds term's ord to TermInfo private final static class TermInfoAndOrd extends TermInfo { -final int termOrd; -public TermInfoAndOrd(TermInfo ti, int termOrd) { +final long termOrd; +public TermInfoAndOrd(TermInfo ti, long termOrd) { super(ti); this.termOrd = termOrd; } @@ -245,7 +245,7 @@ // wipe out the cache when they iterate over a large numbers // of terms in order if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; @@ -262,7 +262,7 @@ // random-access: must seek final int indexPos; if (tiOrd != null) { - indexPos = tiOrd.termOrd / totalIndexInterval; + indexPos = (int) (tiOrd.termOrd / totalIndexInterval); } else { // Must do binary search: indexPos = getIndexOffset(term); @@ -274,7 +274,7 @@ if (enumerator.term() != null term.compareTo(enumerator.term()) == 0) { ti = enumerator.termInfo(); if (tiOrd == null) { -termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); +termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; Mike http://blog.mikemccandless.com On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote: The query below results in an array out of bounds exception: select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr Here is the exception: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) We are using a dev version of Solr/Lucene: Solr Specification Version: 3.0.0.2010.11.19.16.00.54 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54 Lucene Specification Version: 3.1-SNAPSHOT Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Just before the exception we see this entry in our tomcat logs: Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0} Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute Is this a known bug? Can anyone provide a clue as to how we can determine what the problem is? Tom Burton-West Appended Below is the exception stack trace: SEVERE: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:271) at org.apache.lucene.index.TermInfosReader.terms(TermInfosReader.java:338) at org.apache.lucene.index.SegmentReader.terms(SegmentReader.java:928) at
RE: Problems indexing very large set of documents
I found a simpler command-line method to update the PDF files. On some documents it does so perfect, the result is a pixel-for-pixel match and none of the OCR text (which is what all these PDFs are, newspaper articles that have been passed through OCR) is lost. However, on other documents the result is considerably blurrier and some of the OCR text is lost. We've decided to skip any documents that Tika cannot index for now. As Lance stated, it's not specifically the version that causes the problem but rather some quirks caused by different PDF writers, a few tests have confirmed this, so we can't use version to determine which should be skipped. I'm examining the XML responses from the queries, and I cannot figure out how to tell from the XML response whether or not a document was successfully indexed. The status value seems to be 0 regardless of whether indexing was successful or not. So my question is, how can I tell from the response whether or not indexing was actually successful? ~Brandon Waterloo From: Lance Norskog [goks...@gmail.com] Sent: Sunday, April 10, 2011 5:22 PM To: solr-user@lucene.apache.org Subject: Re: Problems indexing very large set of documents There is a library called iText. It parses and writes PDFs very very well, and a simple program will let you do a batch conversion. PDFs are made by a wide range of programs, not just Adobe code. Many of these do weird things and make small mistakes that Tika does not know to handle. In other words there is dirty PDF just like dirty HTML. A percentage of PDFs will fail and that's life. One site that gets press releases from zillions of sites (and thus a wide range of PDF generators) has a 15% failure rate with Tika. Lance On Fri, Apr 8, 2011 at 9:44 AM, Brandon Waterloo brandon.water...@matrix.msu.edu wrote: I think I've finally found the problem. The files that work are PDF version 1.6. The files that do NOT work are PDF version 1.4. I'll look into updating all the old documents to PDF 1.6. Thanks everyone! ~Brandon Waterloo From: Ezequiel Calderara [ezech...@gmail.com] Sent: Friday, April 08, 2011 11:35 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Problems indexing very large set of documents Maybe those files are created with a different Adobe Format version... See this: http://lucene.472066.n3.nabble.com/PDF-parser-exception-td644885.html On Fri, Apr 8, 2011 at 12:14 PM, Brandon Waterloo brandon.water...@matrix.msu.edumailto:brandon.water...@matrix.msu.edu wrote: A second test has revealed that it is something to do with the contents, and not the literal filenames, of the second set of files. I renamed one of the second-format files and tested it and Solr still failed. However, the problem still only applies to those files of the second naming format. From: Brandon Waterloo [brandon.water...@matrix.msu.edumailto:brandon.water...@matrix.msu.edu] Sent: Friday, April 08, 2011 10:40 AM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: RE: Problems indexing very large set of documents I had some time to do some research into the problems. From what I can tell, it appears Solr is tripping up over the filename. These are strictly examples, but, Solr handles this filename fine: 32-130-A0-84-african_activist_archive-a0a6s3-b_12419.pdf However, it fails with either a parsing error or an EOF exception on this filename: 32-130-A08-84-al.sff.document.nusa197102.pdf The only significant difference is that the second filename contains multiple periods. As there are about 1700 files whose filenames are similar to the second format it is simply not possible to change their filenames. In addition they are being used by other applications. Is there something I can change in Solr configs to fix this issue or am I simply SOL until the Solr dev team can work on this? (assuming I put in a ticket) Thanks again everyone, ~Brandon Waterloo From: Chris Hostetter [hossman_luc...@fucit.orgmailto:hossman_luc...@fucit.org] Sent: Tuesday, April 05, 2011 3:03 PM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: RE: Problems indexing very large set of documents : It wasn't just a single file, it was dozens of files all having problems : toward the end just before I killed the process. ... : That is by no means all the errors, that is just a sample of a few. : You can see they all threw HTTP 500 errors. What is strange is, nearly : every file succeeded before about the 2200-files-mark, and nearly every : file after that failed. ..the root question is: do those files *only* fail if you have already indexed ~2200 files, or do they fail if you start up your server and index them first? there may be a resource issued
Re: ArrayIndexOutOfBoundsException with facet query
Right, it's the total number of terms across all fields... unfortunately. This class is used to enroll a term into the terms cache that wraps the terms dictionary, so in theory you could also hit this issue during normal searching when a term is looked up once, and then looked up again (the 2nd time will pull from the cache). I've mod'd Test2BTerms and am running it now... Mike http://blog.mikemccandless.com On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Mike, At first I thought this couldn't be related to the 2.1 Billion terms issue since the only place we have tons of terms is in the OCR field and this is not the OCR field. But then I remembered that the total number of terms in all fields is what matters. We've had no problems with regular searches against the index or with other facet queries. Only with this facet. Is TermInfoAndOrd only used for faceting? I'll go ahead and build the patch and let you know. Tom p.s. Here is the field definition: field name=topicStr type=string indexed=true stored=false multiValued=true/ fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 8:40 AM To: solr-user@lucene.apache.org Cc: Burton-West, Tom Subject: Re: ArrayIndexOutOfBoundsException with facet query Tom, I think I see where this may be -- it looks like another 2B terms bug in Lucene (we are using an int instead of a long in the TermInfoAndOrd class inside TermInfosReader.java), only present in 3.1. I'm also mad that Test2BTerms fails to catch this!! I will go fix that test and confirm it sees this bug. Can you build from source? If so, try this patch: Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java === --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java (revision 1089906) +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -46,8 +46,8 @@ // Just adds term's ord to TermInfo private final static class TermInfoAndOrd extends TermInfo { - final int termOrd; - public TermInfoAndOrd(TermInfo ti, int termOrd) { + final long termOrd; + public TermInfoAndOrd(TermInfo ti, long termOrd) { super(ti); this.termOrd = termOrd; } @@ -245,7 +245,7 @@ // wipe out the cache when they iterate over a large numbers // of terms in order if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; @@ -262,7 +262,7 @@ // random-access: must seek final int indexPos; if (tiOrd != null) { - indexPos = tiOrd.termOrd / totalIndexInterval; + indexPos = (int) (tiOrd.termOrd / totalIndexInterval); } else { // Must do binary search: indexPos = getIndexOffset(term); @@ -274,7 +274,7 @@ if (enumerator.term() != null term.compareTo(enumerator.term()) == 0) { ti = enumerator.termInfo(); if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; Mike http://blog.mikemccandless.com On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote: The query below results in an array out of bounds exception: select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr Here is the exception: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) We are using a dev version of Solr/Lucene: Solr Specification Version: 3.0.0.2010.11.19.16.00.54 Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54 Lucene Specification Version: 3.1-SNAPSHOT Lucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Just before the exception we see this entry in our tomcat logs: Apr 8, 2011 2:01:58 PM org.apache.solr.request.UnInvertedField uninvert INFO: UnInverted multi-valued field {field=topicStr,memSize=7675174,tindexSize=289102,time=2577,phase1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0} Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute Is this a known bug? Can anyone provide a clue as to how we can determine what the problem is? Tom Burton-West
RE: ArrayIndexOutOfBoundsException with facet query
Thanks Mike, With the unpatched version, the first time I run the facet query on topicStr it works fine, but the second time I get the ArrayIndexOutOfBoundsException. If I try different facets such as language, I don't see the same symptoms. Maybe the number of facet values needs to exceed some number to trigger the bug? I rebuilt lucene-core-3.1-SNAPSHOT.jar with your patch and it fixes the problem. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 1:00 PM To: Burton-West, Tom Cc: solr-user@lucene.apache.org Subject: Re: ArrayIndexOutOfBoundsException with facet query Right, it's the total number of terms across all fields... unfortunately. This class is used to enroll a term into the terms cache that wraps the terms dictionary, so in theory you could also hit this issue during normal searching when a term is looked up once, and then looked up again (the 2nd time will pull from the cache). I've mod'd Test2BTerms and am running it now... Mike http://blog.mikemccandless.com On Mon, Apr 11, 2011 at 12:51 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Mike, At first I thought this couldn't be related to the 2.1 Billion terms issue since the only place we have tons of terms is in the OCR field and this is not the OCR field. But then I remembered that the total number of terms in all fields is what matters. We've had no problems with regular searches against the index or with other facet queries. Only with this facet. Is TermInfoAndOrd only used for faceting? I'll go ahead and build the patch and let you know. Tom p.s. Here is the field definition: field name=topicStr type=string indexed=true stored=false multiValued=true/ fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true/ -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 8:40 AM To: solr-user@lucene.apache.org Cc: Burton-West, Tom Subject: Re: ArrayIndexOutOfBoundsException with facet query Tom, I think I see where this may be -- it looks like another 2B terms bug in Lucene (we are using an int instead of a long in the TermInfoAndOrd class inside TermInfosReader.java), only present in 3.1. I'm also mad that Test2BTerms fails to catch this!! I will go fix that test and confirm it sees this bug. Can you build from source? If so, try this patch: Index: lucene/src/java/org/apache/lucene/index/TermInfosReader.java === --- lucene/src/java/org/apache/lucene/index/TermInfosReader.java (revision 1089906) +++ lucene/src/java/org/apache/lucene/index/TermInfosReader.java (working copy) @@ -46,8 +46,8 @@ // Just adds term's ord to TermInfo private final static class TermInfoAndOrd extends TermInfo { - final int termOrd; - public TermInfoAndOrd(TermInfo ti, int termOrd) { + final long termOrd; + public TermInfoAndOrd(TermInfo ti, long termOrd) { super(ti); this.termOrd = termOrd; } @@ -245,7 +245,7 @@ // wipe out the cache when they iterate over a large numbers // of terms in order if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; @@ -262,7 +262,7 @@ // random-access: must seek final int indexPos; if (tiOrd != null) { - indexPos = tiOrd.termOrd / totalIndexInterval; + indexPos = (int) (tiOrd.termOrd / totalIndexInterval); } else { // Must do binary search: indexPos = getIndexOffset(term); @@ -274,7 +274,7 @@ if (enumerator.term() != null term.compareTo(enumerator.term()) == 0) { ti = enumerator.termInfo(); if (tiOrd == null) { - termsCache.put(cacheKey, new TermInfoAndOrd(ti, (int) enumerator.position)); + termsCache.put(cacheKey, new TermInfoAndOrd(ti, enumerator.position)); } else { assert sameTermInfo(ti, tiOrd, enumerator); assert (int) enumerator.position == tiOrd.termOrd; Mike http://blog.mikemccandless.com On Fri, Apr 8, 2011 at 4:53 PM, Burton-West, Tom tburt...@umich.edu wrote: The query below results in an array out of bounds exception: select/?q=solrversion=2.2start=0rows=0facet=truefacet.field=topicStr Here is the exception: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosReader.seekEnum(TermInfosReader.java:201) We are using a dev version of Solr/Lucene: Solr Specification Version: 3.0.0.2010.11.19.16.00.54 Solr Implementation Version:
Lucene Revolution 2011 - Early Bird Ends April 18
A quick reminder that there's one week left on special pricing for Lucene Revolution 2011. Sign up this week and save some serious cash: - Conference Registration, now $545, a savings of $180 over the $725 late registration price - Training Package with 2-day Training plus Conference Registration now $1695, a savings of $200 over the $1895 late registration package price (and even more savings over the a la carte pricing) What can you expect at the conference? - Keynote presentations from The Guardian News and Media’s Stephen Dunn and Redmonk’s Stephen O’Grady - Session track talks on use cases, tutorials and technology strategy at leading edge, innovative companies, including: Travelocity, eBay, eHarmony, EMC, Etsy, Trulia, Intuit, Careerbuilder, ATT, The Ladders and more - Deep internals and implementation guidance at talks by Apache Solr/Lucene committers including Grant Ingersoll, Yonik Seeley, Andrzej Bialecki, Uwe Schindler, Simon Willnauer, Erik Hatcher, Otis Gospodnetic, and more. You will also have an unmatched opportunity to network with over 400 of your peers from the open source search ecosystem, in all sectors of government, universities, start-ups, Fortune 1000 companies, and the developer and user community. Register at: http://us.ootoweb.com/luceneregistration P.S. There are also a few free tickets left for the San Francisco Giants vs. Florida Marlins game on May 24! Michael Bohlig | Lucid Imagination Enterprise Marketing p +1 650 353 4057 x132 m+1 650 703 8383 www.lucidimagination.com
Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1
Hi Lance, I used XPathEntityProcessor with attribut xsl and generate a xml-File in the form of the standard Solr update schema. I lost a lot of performance, it is a pity that XPathEntityProcessor does only use one thread. My tests with a collection of 350T Document: 1. use of XPathRecordReader without xslt: 28min 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min 2. use of XPathEntityProcessor with saxon-xslt: 36min Best regards Karsten Lance There is an option somewhere to use the full XML DOM implementation for using xpaths. The purpose of the XPathEP is to be as simple and dumb as possible and handle most cases: RSS feeds and other open standards. Search for xsl(optional) http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1 --karsten Hi Folks, does anyone improve DIH XPathRecordReader to deal with nested xpaths? e.g. data-config.xml with entity .. processor=XPathEntityProcessor .. field column=title xpath=//body/h1/ field column=alltext” xpath=//body flatten=true/ and the XML stream contains /html/body/h1... will only fill field “alltext” but field “title” will be empty. This is a known issue from 2009 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose So three questions: 1. How to fill a “search over all”-Field without nested xpaths? (schema.xml copyField source=* dest=alltext/ will not help, because we lose the original token order) 2. Does anyone try to improve XPathRecordReader to deal with nested xpaths? 3. Does anyone else need this feature? Best regards Karsten http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
Will Slaves Pileup Replication Requests?
What is the slave replication behavior if a replication request to pull indexes takes longer than the replication interval itself? Anotherwords, if my replication interval is set to be every 30 seconds, and my indexes are significantly large enough to take longer than 30 seconds to transfer, is the slave smart enough to not send another replication request if one is already in progress? -Parker
Re: Will Slaves Pileup Replication Requests?
Yes. It will wait whatever the replication interval is after the most recent replication completes before attempting again. On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote: What is the slave replication behavior if a replication request to pull indexes takes longer than the replication interval itself? Anotherwords, if my replication interval is set to be every 30 seconds, and my indexes are significantly large enough to take longer than 30 seconds to transfer, is the slave smart enough to not send another replication request if one is already in progress? -Parker
Re: Exact match on a field with stemming
Hi, Using quoted means use this as a phrase, not use this as a literal. :) I think copying to unstemmed field is the only/common work-around. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com To: solr-user@lucene.apache.org Sent: Mon, April 11, 2011 2:55:04 PM Subject: Exact match on a field with stemming Hi all, Is there a way to perform an exact match query on a field that has stemming enable by using the standard /select handler? I thought that putting word inside double-quotes would enable this behaviour but if I query my field with a single word like “manager” I am receiving results containing the word “management” I know I can use a CopyField with different types but that would double the size of my index… Is there an alternative? Thanks
Re: Will Slaves Pileup Replication Requests?
Thanks Larry. -Parker On 4/11/11 12:14 PM, Green, Larry (CMG - Digital) larry.gr...@cmgdigital.com wrote: Yes. It will wait whatever the replication interval is after the most recent replication completes before attempting again. On Apr 11, 2011, at 2:42 PM, Parker Johnson wrote: What is the slave replication behavior if a replication request to pull indexes takes longer than the replication interval itself? Anotherwords, if my replication interval is set to be every 30 seconds, and my indexes are significantly large enough to take longer than 30 seconds to transfer, is the slave smart enough to not send another replication request if one is already in progress? -Parker
Question on Dismax plugin
All, I have a question on the Dismax plugin for the search handler. I have two test instances of Solr. In one I am using the default search handler. In this case, the fields that I am working with (slug and story) are indexed via the all_text filed and the searches are done on the all_text field. For the other one I have configured a search handler using the dismax plugin as shown below. requestHandler name=mydismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf story^3.0 slug^0.2 /str int name=ps100/int str name=q.alt*:*/str /lst /requestHandler To make testing easier, I only have 4 (same) documents in both indexes with the word Obama appearing inside as described below. File 1:: The word Obama appears zero times in slug field and four times in story field File 2:: The word Obama appears zero times in slug field and thrice in story field File 3:: The word Obama appears zero times in slug field and two times in story field File 4:: The word Obama appears One time in slug field and one time in story field Here is the order of the documents in the order of decreasing scores from the search results Dismax Search Handler (steadily decreasing scores): * File 1:: The word Obama appears zero times in slug field and four times in story field * File 4:: The word Obama appears One time in slug field and one time in story field * File 2:: The word Obama appears zero times in slug field and thrice in story field * File 3:: The word Obama appears zero times in slug field and two times in story field Standard Search handler: * File 1:: The word Obama appears zero times in slug field and four times in story field * File 2:: The word Obama appears zero times in slug field and thrice in story field (same score as File 4 score below) * File 4:: The word Obama appears One time in slug field and one time in story field (same score as File 2 score above) * File 3:: The word Obama appears zero times in slug field and two times in story field My question, why is dismax showing File 4:: The word Obama appears One time in slug field and one time in story field ahead of File 2:: The word Obama appears zero times in slug field and thrice in story field given that I have boosted these fields as shown below. str name=qf story^3.0 slug^0.2 /str I would have thought that the File 4:: The word Obama appears One time in slug field and one time in story field would have gone all the way done in the result list. Any help is appreciated Thanks much in advance Raj
Re: Question on Dismax plugin
Hi Raj, I'm guessing your slug field is much shorter and thus a match in that field has more weight than a match is a much longer story field. If you omit norms for those fields in the schema (and reindex), I believe you will see File 4 drop to position #4. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Nemani, Raj raj.nem...@turner.com To: solr-user@lucene.apache.org Sent: Mon, April 11, 2011 4:12:52 PM Subject: Question on Dismax plugin All, I have a question on the Dismax plugin for the search handler. I have two test instances of Solr. In one I am using the default search handler. In this case, the fields that I am working with (slug and story) are indexed via the all_text filed and the searches are done on the all_text field. For the other one I have configured a search handler using the dismax plugin as shown below. requestHandler name=mydismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf story^3.0 slug^0.2 /str int name=ps100/int str name=q.alt*:*/str /lst /requestHandler To make testing easier, I only have 4 (same) documents in both indexes with the word Obama appearing inside as described below. File 1:: The word Obama appears zero times in slug field and four times in story field File 2:: The word Obama appears zero times in slug field and thrice in story field File 3:: The word Obama appears zero times in slug field and two times in story field File 4:: The word Obama appears One time in slug field and one time in story field Here is the order of the documents in the order of decreasing scores from the search results Dismax Search Handler (steadily decreasing scores): * File 1:: The word Obama appears zero times in slug field and four times in story field * File 4:: The word Obama appears One time in slug field and one time in story field * File 2:: The word Obama appears zero times in slug field and thrice in story field * File 3:: The word Obama appears zero times in slug field and two times in story field Standard Search handler: * File 1:: The word Obama appears zero times in slug field and four times in story field * File 2:: The word Obama appears zero times in slug field and thrice in story field (same score as File 4 score below) * File 4:: The word Obama appears One time in slug field and one time in story field (same score as File 2 score above) * File 3:: The word Obama appears zero times in slug field and two times in story field My question, why is dismax showing File 4:: The word Obama appears One time in slug field and one time in story field ahead of File 2:: The word Obama appears zero times in slug field and thrice in story field given that I have boosted these fields as shown below. str name=qf story^3.0 slug^0.2 /str I would have thought that the File 4:: The word Obama appears One time in slug field and one time in story field would have gone all the way done in the result list. Any help is appreciated Thanks much in advance Raj
Re: Mongo REST interface and full data import
Thank you guys for your answers. I didn't recognise that it will be so easy to do it and example from http://wiki.apache.org/solr/UpdateJSON#Example works perfectly for me. Regards, Andrew -- View this message in context: http://lucene.472066.n3.nabble.com/Mongo-REST-interface-and-full-data-import-tp2774479p2808507.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MoreLikeThis match
Does anyone have any thoughts on this one? On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb brian.l...@journalexperts.comwrote: I've looked at both wiki pages and none really clarify the difference between these two. If I copy and paste an existing index value for field and do an mlt search, it shows up under match but not results. What is the difference between these two? On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: Actually, what is the difference between match and response? It seems that match always returns one result but I've thrown a few cases at it where the score of the highest response is higher than the score of match. And then there are cases where the match score dwarfs the highest response score. On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I've been using MoreLikeThis for a while through select: http://localhost:8983/solr/select/?q=field:more like thismlt=truemlt.fl=fieldrows=100fl=*,score I was looking over the wiki page today and saw that you can also do this: http://localhost:8983/solr/mlt/?q=field:more like thismlt=truemlt.fl=fieldrows=100 which seems to run faster and do a better job overall. When the results are returned, they are formatted like this: response lst name=responseHeader int name=status0/int int name=QTime1/int /lst result name=match numFound=24 start=0 maxScore=3.0438285 doc float name=score3.0438285/float str name=id5/str /doc /result result name=response numFound=4077 start=0 maxScore=0.12775186 doc float name=score0.1125823/float str name=id3/str /doc doc float name=score0.10231556/float str name=id8/str /doc ... /result /response It seems that it always returns just 1 response under match and response is set by the rows parameter. How can I get more than one result under match? What I'm trying to do here is whatever is set for field:, I would like to return the top 100 records that match that search based on more like this. Thanks, Brian Lamb
Too many open files exception related to solrj getServer too often?
Hi, I get this solrj error in development environment. org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Too many open files At the time there was no reindexing or any write to the index. There were only different queries genrated using solrj to hit solr server: CommonsHttpSolrServer server = new CommonsHttpSolrServer(url); server.setSoTimeout(1000); // socket read timeout server.setConnectionTimeout(1000); server.setDefaultMaxConnectionsPerHost(100); server.setMaxTotalConnections(100); ... QueryResponse rsp = server.query(solrQuery); I did NOT share reference of solrj CommonsHttpSolrServer among requests. So every http request will obtain a solj solr server instance and run query on it. The question is: 1. Should solrj client share one instance of CommonHttpSolrServer? Why? Is every CommonHttpSolrServer matched to one solr/lucene reader? But from the source code, it just shows it related to one apache http client. 2. Is TooManyOpenFiles exeption related to my possible wrong usage of CommonHttpSolrServer? 3. server.query(solrQuery) throws SolrServerException. How can concurrent solr queries triggers Too many open file exception? Look forward to your input. Thanks, cy -- View this message in context: http://lucene.472066.n3.nabble.com/Too-many-open-files-exception-related-to-solrj-getServer-too-often-tp2808718p2808718.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Exact match on a field with stemming
I'm curious to know why Solr is not respecting the phrase. If it consider manager as a phrase... shouldn't it return only document containing that phrase? -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: April-11-11 3:42 PM To: solr-user@lucene.apache.org Subject: Re: Exact match on a field with stemming Hi, Using quoted means use this as a phrase, not use this as a literal. :) I think copying to unstemmed field is the only/common work-around. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com To: solr-user@lucene.apache.org Sent: Mon, April 11, 2011 2:55:04 PM Subject: Exact match on a field with stemming Hi all, Is there a way to perform an exact match query on a field that has stemming enable by using the standard /select handler? I thought that putting word inside double-quotes would enable this behaviour but if I query my field with a single word like “manager” I am receiving results containing the word “management” I know I can use a CopyField with different types but that would double the size of my index… Is there an alternative? Thanks
FW: Exact match on a field with stemming
I'm curious to know why Solr is not respecting the phrase. If it consider manager as a phrase... shouldn't it return only document containing that phrase? A phrase means to solr (or rather to the lucene and dismax query parsers, which are what understand double-quoted phrases) these tokens in exactly this order So a phrase of one token manager, is exactly the same as if you didn't use the double quotes. It's only one token, so all the tokens in this phrase in exactly the order specified is, well, just the same as one token without phrase quotes. If you've set up a stemmed field at indexing time, then manager and management are stemmed IN THE INDEX, probably to something like manag. There is no longer any information in the index (at least in that field) on what the original literal was, it's been stemmed in the index. So there's no way possible for it to only match certain un-stemmed versions -- at least using that field. And when you enter either 'manager' or 'management' at query time, it is analyzed and stemmed to match that stemmed something-like manag in the index either way. If it didn't analyze and stem at query time, then instead the query would just match NOTHING, because neither 'manager' nor 'management' are in the index at all, only the stemmed versions. So, yes, double quotes are interpreted as a phrase, and only documents containing that phrase are returned, you got it. -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: April-11-11 3:42 PM To: solr-user@lucene.apache.org Subject: Re: Exact match on a field with stemming Hi, Using quoted means use this as a phrase, not use this as a literal. :) I think copying to unstemmed field is the only/common work-around. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com To: solr-user@lucene.apache.org Sent: Mon, April 11, 2011 2:55:04 PM Subject: Exact match on a field with stemming Hi all, Is there a way to perform an exact match query on a field that has stemming enable by using the standard /select handler? I thought that putting word inside double-quotes would enable this behaviour but if I query my field with a single word like “manager” I am receiving results containing the word “management” I know I can use a CopyField with different types but that would double the size of my index… Is there an alternative? Thanks
Re: when to change rows param?
Paul: can you elaborate a little bit on what exactly your problem is? - what is the full component list you are using? - how are you changing the param value (ie: what does the code look like) - what isn't working the way you expect? : I've been using my own QueryComponent (that extends the search one) : successfully to rewrite web-received parameters that are sent from the : (ExtJS-based) javascript client. This allows an amount of : query-rewriting, that's good. I tried to change the rows parameter there : (which is limit in the query, as per the underpinnings of ExtJS) but : it seems that this is not enough. : : Which component should I subclass to change the rows parameter? -Hoss
Re: Deduplication questions
: Q1. Is is possible to pass *analyzed* content to the : : public abstract class Signature { No, analysis happens as the documents are being written to the lucene index, well after the UpdateProcessors have had a chance to interact with the values. : Q2. Method calculate() is using concatenated fields from str : name=fieldsname,features,cat/str : Is there any mechanism I could build field dependant signatures? At the moment the Signature API is fairly minimal, but it could definitley be improved by adding more methods (that have sensible defaults in the base class) that would give the impl more control over teh resulting signature ... we just beed people to propose good suggestions with example use cases. : Is idea to make two UpdadeProcessors and chain them OK? (Is ugly, but : would work) I don't know that what you describe is really intentional or not, but it should work -Hoss
Re: XML not coming through from nabble to Gmail
I see the same problem (missing markup) in Thunderbird. Seems like Nabble might be the culprit? -Mike On 4/11/2011 8:13 AM, Erick Erickson wrote: All: Lately I've been seeing a lot of posts where people paste in parts of their schema.xml or solrconfig.xml and the results are...er...disappointing. None of the less-than or greater-than symbols show and the formatting is all over the map. Since some mails would come through with the XML formatted and some would be wonky, at first I thought it was the sender, but then a pretty high percentage came through this way. So I poked around and it seems to only be the case that the XML is wonkified (tm) when it's comes to Gmail from nabble, the original post on nabble has the markup and displays fine. Behavior is the same in Chrome and Firefox BTW. Does anyone have any insight into this? Time to complain to the nabble folks? Do others see this with non-Gmail relays? Thanks, Erick
Re: Solr 1.4.1 compatible with Lucene 3.0.1?
Hi, I only read the short story. :) Note that you should post questions like this on solr-user@lucene list, which is where I'm replying now. Since you are just starting with Solr, why not grab the recently released 3.1? That way you'll get the latest Lucene and the latest Solr. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: RichSimon richard_si...@hms.harvard.edu To: solr-...@lucene.apache.org Sent: Mon, April 11, 2011 10:36:46 AM Subject: Solr 1.4.1 compatible with Lucene 3.0.1? Short story: I am using Lucene 3.0.1, and I'm trying to run Solr 1.4.1. I get an error starting the embedded Solr server that says it cannot find the method FSDirectory.getDirectory. The release notes for Solr 1.4.1 says it is compatible with Lucene 2.9.3, and I see that Lucene 3.0.1 does not have the FSDirectory.getDirectory method any more. Dorwngrading Lucene to 2.9.x is not an option for me. What version of Solr should I use for Lucene 3.0.1? (We're just starting with Solr, so changing that version is not hard.) Or, do I have to upgrade both Solr and Lucene? Thanks, -Rich Here's the long story: I am using Lucene 3.0.1, and I'm trying to run Solr 1.4.1. I have not used any other version of Lucene. We have an existing project using Lucene 3.0.1, and we want to start using Solr. When I try to initialize an embedded Solr server, like so: String solrHome = PATH_TO_SOLR_HOME; File home = new File(solrHome); File solrXML = new File(home, solr.xml); coreContainer = new CoreContainer(); coreContainer.load(solrHome, solrXML); embeddedSolr = new EmbeddedSolrServer(coreContainer, SOLR_CORE); [04-08 11:48:39] ERROR CoreContainer [main]: java.lang.NoSuchMethodError: org.apache.lucene.store.FSDirectory.getDirectory(Ljava/lang/String;)Lorg/apache/lucene/store/FSDirectory; ; at org.apache.solr.spelling.AbstractLuceneSpellChecker.initIndex(AbstractLuceneSpellChecker.java:186) ) at org.apache.solr.spelling.AbstractLuceneSpellChecker.init(AbstractLuceneSpellChecker.java:101) ) ; at org.apache.solr.spelling.IndexBasedSpellChecker.init(IndexBasedSpellChecker.java:56) ) at org.apache.solr.handler.component.SpellCheckComponent.inform(SpellCheckComponent.java:274) ) ; at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508) at org.apache.solr.core.SolrCore.(SolrCore.java:588) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:428) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:278) Looking at Google posts about this, it seemed that this can be caused by a version mismatch between the Lucene version in use and the one Solr tries to use. I noticed a Lucene version tag in the example solrconfig.xml that I’m modifying: LUCENE_40 I changing it to LUCENE_301, changing it to LUCENE_30, and commenting it out, but I still get the same error. Using LucenePackage.get().getImplementationVersion() shows this as the Lucene version: Lucene version: 3.0.1 912433 - 2010-02-21 23:51:22 I also printed my classpath and found the following lucene jars: lucene-analyzers-3.0.1.jar lucene-core-3.0.1.jar lucene-highlighter-3.0.1.jar lucene-memory-3.0.1.jar lucene-misc-2.9.3.jar lucene-queries-2.9.3.jar lucene-snowball-2.9.3.jar lucene-spellchecker-2.9.3.jar The FSDirectory class is in lucene-core. I decompiled the class file in the jar, and did not see a getDirectory method. Also, I used a ClassLoader statement to get an instance of the FSDirectory class my code is using, and printed out the methods; no getDirectory method. I gather from the Lucene Javadoc that the getDirectory method is in FSDirectory for 2.4.0 and for 2.9.0, but is gone in 3.0.1 (the version I'm using). Is Lucene 3.0.1 completely incompatible with Solr 1.4.1? Is there some way to use the luceneMatchVersion tag to tell Solr what version I want to use? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-1-4-1-compatible-with-Lucene-3-0-1-tp2806828p2806828.html Sent from the Solr - Dev mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: partial optimize does not reduce the segment number to maxNumSegments
: I have a core with 120+ segment files and I tried partial optimize specify : maxNumSegments=10, after the optimize the segment files reduced to 64 files; a) the option you want to specify is maxSegments .. not maxNumSegments http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22optimize.22 b) i can't reproduce this ... i just created an index with 200 segments and when i hit the example url from the wiki... curl 'http://localhost:8983/solr/update?optimize=truemaxSegments=10waitFlush=false' ...my index was correctly optimized down to 10 segments. is it possible that you just didn't wait long enough and you were observing the number of segments while the optimize was still taking place? -Hoss
Re: XML not coming through from nabble to Gmail
: I see the same problem (missing markup) in Thunderbird. Seems like Nabble : might be the culprit? if someone can cite some specific examples (by email message-id, or subject, or date+sender, or url from nabble, or url from any public archive, or anything more specific then posts from nabble containing xml) we can check the official apache mail archive which contains the raw message as recieved by ezmlm., such as.. http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201104.mbox/raw/%3cbanlktimcpthzalstrwhn3rtzpxdzkbo...@mail.gmail.com%3E -Hoss
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Awesome. Thanks Jayendra. I hadn't caught these patches yet. I applied SOLR-2416 patch to the solr-3.1 release tag. This resolved the problem of archive files not being unpacked and indexed with Solr CELL. Thanks for the FYI. https://issues.apache.org/jira/browse/SOLR-2416 On Mon, Apr 11, 2011 at 12:02 AM, Jayendra Patil jayendra.patil@gmail.com wrote: The migration of Tika to the latest 0.8 version seems to have reintroduced the issue. I was able to get this working again with the following patches. (Solr Cell and Data Import handler) https://issues.apache.org/jira/browse/SOLR-2416 https://issues.apache.org/jira/browse/SOLR-2332 You can try these. Regards, Jayendra On Sun, Apr 10, 2011 at 10:35 PM, Joey Hanzel phan...@nearinfinity.com wrote: Hi Gary, I have been experiencing the same problem... Unable to extract content from archive file formats. I just tried again with a clean install of Solr 3.1.0 (using Tika 0.8) and continue to experience the same results. Did you have any success with this problem with Solr 1.4.1 or 3.1.0 ? I'm using this curl command to send data to Solr. curl http://localhost:8080/solr/update/extract?literal.id=doc1fmap.content=attr_contentcommit=true -H application/octet-stream -F myfile=@data.zip No problem extracting single rich text documents, but archive files only result in the file names within the archive being indexed. Am I missing something else in my configuration? Solr doesn't seem to be unpacking the archive files. Based on the email chain associated with your first message, some people have been able to get this functionality to work as desired. On Mon, Jan 31, 2011 at 8:27 AM, Gary Taylor g...@inovem.com wrote: Can anyone shed any light on this, and whether it could be a config issue? I'm now using the latest SVN trunk, which includes the Tika 0.8 jars. When I send a ZIP file (containing two txt files, doc1.txt and doc2.txt) to the ExtractingRequestHandler, I get the following log entry (formatted for ease of reading) : SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, application/octet-stream, stream_size, 260, stream_name, solr1.zip, Content-Type, application/zip] }, ignored_=ignored_(1.0)={ [package-entry, package-entry] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={application/octet-stream}, ignored_stream_size=ignored_stream_size(1.0)={260}, ignored_stream_name=ignored_stream_name(1.0)={solr1.zip}, ignored_content_type=ignored_content_type(1.0)={application/zip}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={ doc2.txtdoc1.txt} } ] So, the data coming back from Tika when parsing a ZIP file does not include the file contents, only the names of the files contained therein. I've tried forcing stream.type=application/zip in the CURL string, but that makes no difference. If I specify an invalid stream.type then I get an exception response, so I know it's being used. When I send one of those txt files individually to the ExtractingRequestHandler, I get: SolrInputDocument[ { ignored_meta=ignored_meta(1.0)={ [stream_source_info, file, stream_content_type, text/plain, stream_size, 30, Content-Encoding, ISO-8859-1, stream_name, doc1.txt] }, ignored_stream_source_info=ignored_stream_source_info(1.0)={file}, ignored_stream_content_type=ignored_stream_content_type(1.0)={text/plain}, ignored_stream_size=ignored_stream_size(1.0)={30}, ignored_content_encoding=ignored_content_encoding(1.0)={ISO-8859-1}, ignored_stream_name=ignored_stream_name(1.0)={doc1.txt}, docid=docid(1.0)={74}, type=type(1.0)={5}, text=text(1.0)={The quick brown fox } } ] and we see the file contents in the text field. I'm using the following requestHandler definition in solrconfig.xml: !-- Solr Cell: http://wiki.apache.org/solr/ExtractingRequestHandler-- requestHandler name=/update/extract class=org.apache.solr.handler.extraction.ExtractingRequestHandler startup=lazy lst name=defaults !-- All the main content goes into text... if you need to return the extracted text or do highlighting, use a stored field. -- str name=fmap.contenttext/str str name=lowernamestrue/str str name=uprefixignored_/str !-- capture link hrefs but ignore div attributes -- str name=captureAttrtrue/str str name=fmap.alinks/str str name=fmap.divignored_/str /lst /requestHandler Is there any further debug or diagnostic I can get out of Tika to help me work out why it's only returning the file names and not the file contents when parsing a ZIP file? Thanks and kind regards,
RE: Exact match on a field with stemming
Thanks for the clarification. This make sense. -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: April-11-11 7:54 PM To: solr-user@lucene.apache.org Subject: FW: Exact match on a field with stemming I'm curious to know why Solr is not respecting the phrase. If it consider manager as a phrase... shouldn't it return only document containing that phrase? A phrase means to solr (or rather to the lucene and dismax query parsers, which are what understand double-quoted phrases) these tokens in exactly this order So a phrase of one token manager, is exactly the same as if you didn't use the double quotes. It's only one token, so all the tokens in this phrase in exactly the order specified is, well, just the same as one token without phrase quotes. If you've set up a stemmed field at indexing time, then manager and management are stemmed IN THE INDEX, probably to something like manag. There is no longer any information in the index (at least in that field) on what the original literal was, it's been stemmed in the index. So there's no way possible for it to only match certain un-stemmed versions -- at least using that field. And when you enter either 'manager' or 'management' at query time, it is analyzed and stemmed to match that stemmed something-like manag in the index either way. If it didn't analyze and stem at query time, then instead the query would just match NOTHING, because neither 'manager' nor 'management' are in the index at all, only the stemmed versions. So, yes, double quotes are interpreted as a phrase, and only documents containing that phrase are returned, you got it. -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: April-11-11 3:42 PM To: solr-user@lucene.apache.org Subject: Re: Exact match on a field with stemming Hi, Using quoted means use this as a phrase, not use this as a literal. :) I think copying to unstemmed field is the only/common work-around. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Pierre-Luc Thibeault pierre-luc.thibea...@wantedtech.com To: solr-user@lucene.apache.org Sent: Mon, April 11, 2011 2:55:04 PM Subject: Exact match on a field with stemming Hi all, Is there a way to perform an exact match query on a field that has stemming enable by using the standard /select handler? I thought that putting word inside double-quotes would enable this behaviour but if I query my field with a single word like manager I am receiving results containing the word management I know I can use a CopyField with different types but that would double the size of my index. Is there an alternative? Thanks =
Re: MoreLikeThis match
Match is the document that's the top result of the query (q param) that you specify. Response is the list of documents that are similar to the 'match' document. -Mike On Mon, Apr 11, 2011 at 4:55 PM, Brian Lamb brian.l...@journalexperts.com wrote: Does anyone have any thoughts on this one? On Fri, Apr 8, 2011 at 9:26 AM, Brian Lamb brian.l...@journalexperts.comwrote: I've looked at both wiki pages and none really clarify the difference between these two. If I copy and paste an existing index value for field and do an mlt search, it shows up under match but not results. What is the difference between these two? On Thu, Apr 7, 2011 at 2:24 PM, Brian Lamb brian.l...@journalexperts.comwrote: Actually, what is the difference between match and response? It seems that match always returns one result but I've thrown a few cases at it where the score of the highest response is higher than the score of match. And then there are cases where the match score dwarfs the highest response score. On Thu, Apr 7, 2011 at 1:30 PM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I've been using MoreLikeThis for a while through select: http://localhost:8983/solr/select/?q=field:more like thismlt=truemlt.fl=fieldrows=100fl=*,score I was looking over the wiki page today and saw that you can also do this: http://localhost:8983/solr/mlt/?q=field:more like thismlt=truemlt.fl=fieldrows=100 which seems to run faster and do a better job overall. When the results are returned, they are formatted like this: response lst name=responseHeader int name=status0/int int name=QTime1/int /lst result name=match numFound=24 start=0 maxScore=3.0438285 doc float name=score3.0438285/float str name=id5/str /doc /result result name=response numFound=4077 start=0 maxScore=0.12775186 doc float name=score0.1125823/float str name=id3/str /doc doc float name=score0.10231556/float str name=id8/str /doc ... /result /response It seems that it always returns just 1 response under match and response is set by the rows parameter. How can I get more than one result under match? What I'm trying to do here is whatever is set for field:, I would like to return the top 100 records that match that search based on more like this. Thanks, Brian Lamb
Re: DIH: Enhance XPathRecordReader to deal with //body(FLATTEN=true) and //body/h1
The DIH has multi-threading. You can have one thread fetching files and then give them to different threads. On Mon, Apr 11, 2011 at 11:40 AM, karsten-s...@gmx.de wrote: Hi Lance, I used XPathEntityProcessor with attribut xsl and generate a xml-File in the form of the standard Solr update schema. I lost a lot of performance, it is a pity that XPathEntityProcessor does only use one thread. My tests with a collection of 350T Document: 1. use of XPathRecordReader without xslt: 28min 2. use of XPathEntityProcessor with xslt (Standard solr-war / Xalan): 44min 2. use of XPathEntityProcessor with saxon-xslt: 36min Best regards Karsten Lance There is an option somewhere to use the full XML DOM implementation for using xpaths. The purpose of the XPathEP is to be as simple and dumb as possible and handle most cases: RSS feeds and other open standards. Search for xsl(optional) http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1 --karsten Hi Folks, does anyone improve DIH XPathRecordReader to deal with nested xpaths? e.g. data-config.xml with entity .. processor=XPathEntityProcessor .. field column=title xpath=//body/h1/ field column=alltext” xpath=//body flatten=true/ and the XML stream contains /html/body/h1... will only fill field “alltext” but field “title” will be empty. This is a known issue from 2009 https://issues.apache.org/jira/browse/SOLR-1437#commentauthor_12756469_verbose So three questions: 1. How to fill a “search over all”-Field without nested xpaths? (schema.xml copyField source=* dest=alltext/ will not help, because we lose the original token order) 2. Does anyone try to improve XPathRecordReader to deal with nested xpaths? 3. Does anyone else need this feature? Best regards Karsten http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html -- Lance Norskog goks...@gmail.com
Re: Solr under Tomcat
Hi Mike- Please start a new thread for this. On Mon, Apr 11, 2011 at 2:47 AM, Mike satish01sud...@gmail.com wrote: Hi All, I have installed solr instance on tomcat6. When i tried to index the PDF file i was able to see the response: 0 479 Query: http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true But when i tried to search the content in the pdf i could not get any results: 0 2 − on 0 struts 10 2.2 Could you please let me know if I am doing anything wrong. It works fine when i tried with default jetty server prior to integrating on the tomcat6. I have followed installation steps from http://wiki.apache.org/solr/SolrTomcat (Tomcat on Windows Single Solr app). Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-under-Tomcat-tp2613501p2805970.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Indexing Best Practice
SOLR-1499 is a plug-in for the DIH that uses Solr as a DataSource. This means that you can read the database and PDFs separately. You could index all of the PDF content in one DIH script. Then, when there's a database update, you have a separate DIH scripts that reads the old row from Solr, and pulls the stripped text from the PDF, and then re-indexes the whole thing. This would cut out the need to reparse the PDF. Lance On Mon, Apr 11, 2011 at 8:48 AM, Shaun Campbell campbell.sh...@gmail.com wrote: If it's of any help I've split the processing of PDF files from the indexing. I put the PDF content into a text file (but I guess you could load it into a database) and use that as part of the indexing. My processing of the PDF files also compares timestamps on the document and the text file so that I'm only processing documents that have changed. I am a newbie so perhaps there's more sophisticated approaches. Hope that helps. Shaun On 11 April 2011 07:20, Darx Oman darxo...@gmail.com wrote: Hi guys I'm wondering how to best configure solr to fulfills my requirements. I'm indexing data from 2 data sources: 1- Database 2- PDF files (password encrypted) Every file has related information stored in the database. Both the file content and the related database fields must be indexed as one document in solr. Among the DB data is *per-user* permissions for every document. The file contents nearly never change, on the other hand, the DB data and especially the permissions change very frequently which require me to re-index everything for every modified document. My problem is in process of decrypting the PDF files before re-indexing them which takes too much time for a large number of documents, it could span to days in full re-indexing. What I'm trying to accomplish is eliminating the need to re-index the PDF content if not changed even if the DB data changed. I know this is not possible in solr, because solr doesn't update documents. So how to best accomplish this: Can I use 2 indexes one for PDF contents and the other for DB data and have a common id field for both as a link between them, *and results are treated as one Document*? -- Lance Norskog goks...@gmail.com
Re: Tika, Solr running under Tomcat 6 on Debian
Ah! Did you set the UTF-8 parameter in Tomcat? On Mon, Apr 11, 2011 at 2:49 AM, Mike satish01sud...@gmail.com wrote: Hi Roy, Thank you for the quick reply. When i tried to index the PDF file i was able to see the response: 0 479 Query: http://localhost:8080/solr/update/extract?stream.file=D:\mike\lucene\apache-solr-1.4.1\example\exampledocs\Struts%202%20Design%20and%20Programming1.pdfstream.contentType=application/pdfliteral.id=Struts%202%20Design%20and%20Programming1.pdfdefaultField=textcommit=true But when i tried to search the content in the pdf i could not get any results: 0 2 − on 0 struts 10 2.2 Could you please let me know if I am doing anything wrong. It works fine when i tried with default jetty server prior to integrating on the tomcat6. I have followed installation steps from http://wiki.apache.org/solr/SolrTomcat (Tomcat on Windows Single Solr app). Thanks, Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Tika-Solr-running-under-Tomcat-6-on-Debian-tp993295p2805974.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Solr 3.1 performance compared to 1.4.1
Marius: I have copied the configuration from 1.4.1 to the 3.1. Does the Directory implementation show up in the JMX beans? In admin/statistics.jsp ? Or the Solr startup logs? (Sorry, don't have a Solr available.) Yonik: What platform are you on? I believe the Lucene Directory implementation now tries to be smarter (compared to lucene 2.9) about picking the best default (but it may not be working out for you for some reason) Lance On Sun, Apr 10, 2011 at 12:46 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Fri, Apr 8, 2011 at 9:53 AM, Marius van Zwijndregt pionw...@gmail.com wrote: Hello ! I'm new to the list, have been using SOLR for roughly 6 months and love it. Currently i'm setting up a 3.1 installation, next to a 1.4.1 installation (Ubuntu server, same JVM params). I have copied the configuration from 1.4.1 to the 3.1. Both version are running fine, but one thing ive noticed, is that the QTime on 3.1, is much slower for initial searches than on the (currently production) 1.4.1 installation. For example: Searching with 3.1; http://mysite:9983/solr/select?q=grasmaaier: QTime returns 371 Searching with 1.4.1: http://mysite:8983/solr/select?q=grasmaaier: QTime returns 59 Using debugQuery=true, i can see that the main time is spend in the query component itself (org.apache.solr.handler.component.QueryComponent). Can someone explain this, and how can i analyze this further ? Does it take time to build up a decent query, so could i switch to 3.1 without having to worry ? Thanks for the report... there's no reason that anything should really be much slower, so it would be great to get to the bottom of this! Is this using the same index as the 1.4.1 server, or did you rebuild it? Are there any other query parameters (that are perhaps added by default, like faceting or anything else that could take up time) or is this truly just a term query? What platform are you on? I believe the Lucene Directory implementation now tries to be smarter (compared to lucene 2.9) about picking the best default (but it may not be working out for you for some reason). -Yonik http://www.lucenerevolution.org -- Lucene/Solr User Conference, May 25-26, San Francisco -- Lance Norskog goks...@gmail.com
Indexing Flickr and Panaramio
Has anyone tried doing this? Got any tips for someone getting started? Thanks, Adam Sent from my iPhone
Re: Clarifying fetchindex command
Looking at the code, issuing a fetchindex will cause the fetch to occur right away, with no respect for polling. - Mark On Apr 11, 2011, at 12:37 PM, Otis Gospodnetic wrote: Hi, Can one actually *force* replication of the index from the master without a commit being issued on the master since the last replication? I do see Force a fetchindex on slave from master command: http://slave_host:port/solr/replication?command=fetchindex; on http://wiki.apache.org/solr/SolrReplication#HTTP_API, but that feels more like force the replication *now* instead of waiting for the slave to poll the master than force the replication even if there is no new commit point and no new index version on the master. Which one is it, really? Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org