Invalid character in search results
Hi, I use Solr 1.1 application for indexing russian documents. Sometimes I've got as search results docs with invalid character. For example I've indexed иго but search returned и��о. It's strange because something has changed 2 bytes into 6 bytes. иго - D0 B8 D0 B3 D0 BE и��о - D0 B8 EF BF BD EF BF BD D0 BE This field is indexed as string verbatim. fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ After reindexing documents with invalid character are fixed. Has anybody idea where is the problem? Maciek
Issues using keyword searching and facet search together in a search operation
Hi, When i use both the Keyword search and the facet search together in a same search operation, I dont get any results whereas if i perform them seperately, i could get back the results. Is it a constraint from the SOLR point of view? Thanks in advance. Regards, Dilip TS
Re: Issues using keyword searching and facet search together in a search operation
I can't answer the question, but I *can* guarantee that the people who can will give you *much* better responses if you include some details. Like which analyzers you use, how you submit the query, samples of the two queries that work and the one that doesn't. Imagine you're on the receiving end if this question and ask is there enough info here to make a meaningful analysis G... Best Erick On Dec 4, 2007 5:39 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hi, When i use both the Keyword search and the facet search together in a same search operation, I dont get any results whereas if i perform them seperately, i could get back the results. Is it a constraint from the SOLR point of view? Thanks in advance. Regards, Dilip TS
RE: Issues using keyword searching and facet search together in a search operation
Hi, Considering the following scenario where i need to use keyword search on fields title and description with the keyword typed as testing And using the search on fields price, publisher and tag , the fields publisher and tag being selected for the facet searching If the constructed queryString using the above scenario is something like this: facet.limit=-1rows=100start=0facet=truefacet.mincount=1facet.field=tag facet.field=publisherq=title:testing+OR+description:testing+AND+title:lucen er;title+asc,score+ascqt=standard 0 15 Im using the following analyzers: analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Regards Dilip -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 04, 2007 5:30 PM To: solr-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Issues using keyword searching and facet search together in a search operation I can't answer the question, but I can guarantee that the people who can will give you much better responses if you include some details. Like which analyzers you use, how you submit the query, samples of the two queries that work and the one that doesn't. Best Erick On Dec 4, 2007 5:39 AM, Dilip.TS [EMAIL PROTECTED] wrote: Hi, When i use both the Keyword search and the facet search together in a same search operation, I dont get any results whereas if i perform them seperately, i could get back the results. Is it a constraint from the SOLR point of view? Thanks in advance. Regards, Dilip TS
Field seperater for highlighting multi-value fields
Hi, The default field separator seems to be a '.' when highlighting multi-value fields. Can this be overridden in 1.2 to another character? Thanks! harry
Re: Issues using keyword searching and facet search together in a search operation
On Dec 4, 2007 5:39 AM, Dilip.TS [EMAIL PROTECTED] wrote: When i use both the Keyword search and the facet search together in a same search operation, I dont get any results whereas if i perform them seperately, i could get back the results. add debugQuery=on to your requests (and change rows to something small like 5), and then post the results of both URLs here. -Yonik
Re: Invalid character in search results
On Dec 4, 2007 5:02 AM, Maciej Szczytowski [EMAIL PROTECTED] wrote: Hi, I use Solr 1.1 application for indexing russian documents. Sometimes I've got as search results docs with invalid character. For example I've indexed иго but search returned и��о. It's strange because something has changed 2 bytes into 6 bytes. иго - D0 B8 D0 B3 D0 BE и��о - D0 B8 EF BF BD EF BF BD D0 BE This field is indexed as string verbatim. fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ After reindexing documents with invalid character are fixed. Has anybody idea where is the problem? Probably an issue with the charset not being set correctly (or the character encoding not matching the charset declaration) when it was first indexed. -Yonik
Re: out of heap space, every day
For faceting and sorting, yes. For normal search, no. Interesting you mention that, because one of the other changes since last week besides the index growing is that we added a sort to an sint field on the queries. Is it reasonable that a sint sort would require over 2.5GB of heap on a 8M index? Is there any empirical data on how much RAM that will need?
Re: out of heap space, every day
On Dec 4, 2007 10:59 AM, Brian Whitman [EMAIL PROTECTED] wrote: For faceting and sorting, yes. For normal search, no. Interesting you mention that, because one of the other changes since last week besides the index growing is that we added a sort to an sint field on the queries. Is it reasonable that a sint sort would require over 2.5GB of heap on a 8M index? Is there any empirical data on how much RAM that will need? int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. One can decrease this memory usage by using an integer instead of an sint field if you don't need range queries. The memory usage would then drop to a straight int[maxDoc()] (4 bytes per document). -Yonik
Re: out of heap space, every day
On Dec 4, 2007 10:46 AM, Brian Whitman [EMAIL PROTECTED] wrote: Are there 'native' memory requirements for solr as a function of index size? For faceting and sorting, yes. For normal search, no. -Yonik
out of heap space, every day
This maybe more of a general java q than a solr one, but I'm a bit confused. We have a largish solr index, about 8M documents, the data dir is about 70G. We're getting about 500K new docs a week, as well as about 1 query/second. Recently (when we crossed about the 6M threshold) resin has been stopping with the following: /usr/local/resin/log/stdout.log:[12:08:21.749] [28304] HTTP/1.1 500 Java heap space /usr/local/resin/log/stdout.log:[12:08:21.749] java.lang.OutOfMemoryError: Java heap space Only a restart of resin will get it going again, and then it'll crash again within 24 hours. It's a 4GB machine and we run it with args=-J-mx2500m -J-ms2000m We can't really raise this any higher on the machine. Are there 'native' memory requirements for solr as a function of index size? Does a 70GB index require some minimum amount of wired RAM? Or is there some mis-configuration w/ resin or solr or my system? I don't really know Java well but it seems strange that the VM can't page RAM out to disk or really do something else beside stopping the server.
Re: out of heap space, every day
Hello, I am also fighting with heap exhaustion, however during the indexing step. I was able to minimize, but not fix the problem by setting the thread stack size to 64k with -Xss64k. The minimum size is os specific, but the VM will tell you if you set the size too small. You can try it, it may help Brian Brian Whitman schrieb: This maybe more of a general java q than a solr one, but I'm a bit confused. We have a largish solr index, about 8M documents, the data dir is about 70G. We're getting about 500K new docs a week, as well as about 1 query/second. Recently (when we crossed about the 6M threshold) resin has been stopping with the following: /usr/local/resin/log/stdout.log:[12:08:21.749] [28304] HTTP/1.1 500 Java heap space /usr/local/resin/log/stdout.log:[12:08:21.749] java.lang.OutOfMemoryError: Java heap space Only a restart of resin will get it going again, and then it'll crash again within 24 hours. It's a 4GB machine and we run it with args=-J-mx2500m -J-ms2000m We can't really raise this any higher on the machine. Are there 'native' memory requirements for solr as a function of index size? Does a 70GB index require some minimum amount of wired RAM? Or is there some mis-configuration w/ resin or solr or my system? I don't really know Java well but it seems strange that the VM can't page RAM out to disk or really do something else beside stopping the server.
Cache use
Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene
Re: Cache use
One way to do this if you are running on linux is to create a tempfs (which is ram) and then mount the filesystem in the ram. Then your index acts normally to the application but is essentially served from Ram. This is how we server the Nutch lucene indexes on our web search engine (www.visvo.com) which is ~100M pages. Below is how you can achieve this, assuming your indexes are in /path/to/indexes: mv /path/to/indexes /path/to/indexes.dist mkdir /path/to/indexes cd /path/to mount -t tmpfs -o size=2684354560 none /path/to/indexes rsync --progress -aptv indexes.dist/* indexes/ chown -R user:group indexes This would of course be limited by the amount of RAM you have on the machine. But with this approach most searches are sub-second. Dennis Kubes Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene
Tomcat6 env-entry
It works excellently in Tomcat 6. The toughest thing I had to deal with is discovering that the environment variable in web.xml for solr/home is essential. If you skip that step, it won't come up. env-entry env-entry-namesolr/home/env-entry-name env-entry-typejava.lang.String/env-entry-type env-entry-valueF:\Tomcat-6.0.14\webapps\solr/env-entry-value /env-entry - Original Message - From: Charlie Jackson [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, December 03, 2007 11:35 AM Subject: RE: Tomcat6? $CALINA_HOME/conf/Catalina/localhost doesn't exist by default, but you can create it and it will work exactly the same way it did in Tomcat 5. It's not created by default because its not needed by the manager webapp anymore. -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Monday, December 03, 2007 10:15 AM To: solr-user@lucene.apache.org Subject: Re: Tomcat6? In context.xml, I added.. Environment name=/solr/home value=/Users/mruno/solr-src/example/ solr type=java.lang.String / I think that's all I did to get it working in Tocmat 6. --Matthew Runo On Dec 3, 2007, at 7:58 AM, Jörg Kiegeland wrote: In the Solr wiki, there is not described how to install Solr on Tomcat 6, and I not managed it myself :( In the chapter Configuring Solr Home with JNDI there is mentioned the directory $CATALINA_HOME/conf/Catalina/localhost , which not exists with TOMCAT 6. Alternatively I tried the folder $CATALINA_HOME/work/Catalina/ localhost, but with no success.. (I can query the top level page, but the Solr Admin link then not works). Can anybody help? -- Dipl.-Inf. Jörg Kiegeland ikv++ technologies ag Bernburger Strasse 24-25, D-10963 Berlin e-mail: [EMAIL PROTECTED], web: http://www.ikv.de phone: +49 30 34 80 77 18, fax: +49 30 34 80 78 0 = Handelsregister HRB 81096; Amtsgericht Berlin-Charlottenburg board of directors: Dr. Olaf Kath (CEO); Dr. Marc Born (CTO) supervising board: Prof. Dr. Bernd Mahr (chairman) _ -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.5.503 / Virus Database: 269.16.12/1162 - Release Date: 11/30/2007 9:26 PM
Re: Cache use
The first step is to look at what searches are taking too long, and see if there is a way to structure them so they don't take as long. The whole index doesn't have to be in memory to get good search performance, but 100M documents on a single server is big. We are working on distributed search (SOLR-303) so an index can be split across multiple servers. -Yonik On Dec 4, 2007 11:43 AM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene
SOLR 1.3 trunk error
Hello! I'm trying to make use of SOLR 1.3, svn trunk, and get the following error. SEVERE: java.lang.NoSuchMethodError: org.apache.solr.search.QParser.getSort(Z)Lorg/apache/solr/search/ QueryParsing$SortSpec; at org .apache .solr.handler.component.QueryComponent.prepare(QueryComponent.java:66) at org .apache .solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:93) at org .apache .solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:826) at org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206) at org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org .apache .catalina .core .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 235) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org .apache .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 233) at org .apache .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 175) at org .apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java: 128) at org .apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java: 102) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java: 844) at org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java: 447) at java.lang.Thread.run(Thread.java:619) --Matthew
Re: SOLR 1.3 trunk error
Ooops, I get this error when I try to search an index with a few documents in it. ie.. http://dev14.zappos.com:8080/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on caching : true numDocs : 5 maxDoc : 5 readerImpl : MultiReader readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index indexVersion : 1196707950551 openedAt : Tue Dec 04 10:14:58 PST 2007 registeredAt : Tue Dec 04 10:14:58 PST 2007 On Dec 4, 2007, at 10:19 AM, Matthew Runo wrote: Hello! I'm trying to make use of SOLR 1.3, svn trunk, and get the following error. SEVERE: java.lang.NoSuchMethodError: org.apache.solr.search.QParser.getSort(Z)Lorg/apache/solr/search/ QueryParsing$SortSpec; at org .apache .solr.handler.component.QueryComponent.prepare(QueryComponent.java:66) at org .apache .solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:93) at org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:826) at org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206) at org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org .apache .catalina .core .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 235) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org .apache .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 233) at org .apache .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 175) at org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109) at org .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 263) at org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 844) at org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint $Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) --Matthew
Re: out of heap space, every day
On 4-Dec-07, at 8:10 AM, Brian Carmalt wrote: Hello, I am also fighting with heap exhaustion, however during the indexing step. I was able to minimize, but not fix the problem by setting the thread stack size to 64k with -Xss64k. The minimum size is os specific, but the VM will tell you if you set the size too small. You can try it, it may help This seems surprising unless you are positively hammering Solr with tons of different threads during indexing. It's probably not worth using more than # processors + a few. -Mike
Re: SOLR 1.3 trunk error
did you try 'ant clean' before running 'ant dist'? the method signature for SortSpec changed recently Matthew Runo wrote: Ooops, I get this error when I try to search an index with a few documents in it. ie.. http://dev14.zappos.com:8080/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on caching : true numDocs : 5 maxDoc : 5 readerImpl : MultiReader readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index indexVersion : 1196707950551 openedAt : Tue Dec 04 10:14:58 PST 2007 registeredAt : Tue Dec 04 10:14:58 PST 2007 On Dec 4, 2007, at 10:19 AM, Matthew Runo wrote: Hello! I'm trying to make use of SOLR 1.3, svn trunk, and get the following error. SEVERE: java.lang.NoSuchMethodError: org.apache.solr.search.QParser.getSort(Z)Lorg/apache/solr/search/QueryParsing$SortSpec; at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:66) at org.apache.solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:93) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:826) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) --Matthew
Re: SOLR 1.3 trunk error
Wow. So I feel stupid. Sorry to waste your time =p --Matthew On Dec 4, 2007, at 10:36 AM, Ryan McKinley wrote: did you try 'ant clean' before running 'ant dist'? the method signature for SortSpec changed recently Matthew Runo wrote: Ooops, I get this error when I try to search an index with a few documents in it. ie.. http://dev14.zappos.com:8080/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on caching : true numDocs : 5 maxDoc : 5 readerImpl : MultiReader readerDir : org.apache.lucene.store.FSDirectory@/opt/solr/data/index indexVersion : 1196707950551 openedAt : Tue Dec 04 10:14:58 PST 2007 registeredAt : Tue Dec 04 10:14:58 PST 2007 On Dec 4, 2007, at 10:19 AM, Matthew Runo wrote: Hello! I'm trying to make use of SOLR 1.3, svn trunk, and get the following error. SEVERE: java.lang.NoSuchMethodError: org.apache.solr.search.QParser.getSort(Z)Lorg/apache/solr/search/ QueryParsing$SortSpec; at org .apache .solr.handler.component.QueryComponent.prepare(QueryComponent.java: 66) at org .apache .solr.handler.SearchHandler.handleRequestBody(SearchHandler.java:93) at org .apache .solr .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:826) at org .apache .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java: 206) at org .apache .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 174) at org .apache .catalina .core .ApplicationFilterChain .internalDoFilter(ApplicationFilterChain.java:235) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 206) at org .apache .catalina .core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org .apache .catalina .core.StandardContextValve.invoke(StandardContextValve.java:175) at org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109) at org .apache .catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 844) at org.apache.coyote.http11.Http11Protocol $Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint $Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) --Matthew
Re: Cache use
On 4-Dec-07, at 8:43 AM, Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How big is the index on disk (the most important files are .frq, and .prx if you do phrase queries? How big and what exactly is a record in your system? Do you do faceting/sorting? How much memory do you have? What does a typical query look like? Performance is a tricky subject. It is hard to give any kind of useful answer that applies in general. The one thing I can say is that 110M is a _lot_ of docs for one system, especially if these are normal-sized documents regards, -Mike
Re: Cache use
Any suggestions are helpful to me,. even general.. Here is the info from my index: How big is the index on disk (the most important files are .frq, and .prx if you do phrase queries? - Total index folder size is 30.7 Gb - .frq is 12.2 Gb - .prx is 6 Gb How big and what exactly is a record in your system? - record is a document with 100 fields indexed and 10 of them stored. Approximately 60% of fields are containing data. Do you do faceting/sorting? - Yes, I'm planing to do both. How much memory do you have? - I do have 8Gb of RAM I could get up to 16Gb What does a typical query look like? - I don't know yet. We are in prototype mode. We try everything possible. In general we are able to get results in sub-second. But some queries take long, for example TOWN:L* I know this is very broad query, and probably the worst one. But we could need such queries to get quantity of such towns with name starting with L, for example. Cache helps a little, for example after this query if I run TOWN:La* I'm getting result in milliseconds. But what wonders me is: if I'm running query like this: TOWN:L* OR STREET:S* I'm guessing it should cache all data of this set. If after I run just TOWN:L* , which is subset of the first query, it still takes time to get the result back, as if it's not cached.. - Original Message From: Mike Klaas [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, December 4, 2007 2:33:24 PM Subject: Re: Cache use On 4-Dec-07, at 8:43 AM, Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How big is the index on disk (the most important files are .frq, and .prx if you do phrase queries? How big and what exactly is a record in your system? Do you do faceting/sorting? How much memory do you have? What does a typical query look like? Performance is a tricky subject. It is hard to give any kind of useful answer that applies in general. The one thing I can say is that 110M is a _lot_ of docs for one system, especially if these are normal-sized documents regards, -Mike
Re: Cache use
Thanks, this is very interesting idea. But my index folder is about 30Gb. Max RAM I could get probably is 16Gb. Rest could be in swap, but I think it will kill the whole idea.. May be it will be useful to put just some files from index folder to RAM? If this is possible at all))... - Original Message From: Dennis Kubes [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, December 4, 2007 12:00:55 PM Subject: Re: Cache use One way to do this if you are running on linux is to create a tempfs (which is ram) and then mount the filesystem in the ram. Then your index acts normally to the application but is essentially served from Ram. This is how we server the Nutch lucene indexes on our web search engine (www.visvo.com) which is ~100M pages. Below is how you can achieve this, assuming your indexes are in /path/to/indexes: mv /path/to/indexes /path/to/indexes.dist mkdir /path/to/indexes cd /path/to mount -t tmpfs -o size=2684354560 none /path/to/indexes rsync --progress -aptv indexes.dist/* indexes/ chown -R user:group indexes This would of course be limited by the amount of RAM you have on the machine. But with this approach most searches are sub-second. Dennis Kubes Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene
RE: How to delete records that don't contain a field?
Oops, I should explain. *:* means all records. This trick puts a positive query in front of your negative query, and that allows it to work. Lance -Original Message- From: Rob Casson [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 04, 2007 7:44 AM To: solr-user@lucene.apache.org Subject: Re: How to delete records that don't contain a field? i'm using this: deletequery*:* -[* TO *]/query/delete which is what lance suggested..works just fine. fyi: https://issues.apache.org/jira/browse/SOLR-381 On Dec 3, 2007 8:09 PM, Norskog, Lance [EMAIL PROTECTED] wrote: Wouldn't this be: *:* AND negative query -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, December 03, 2007 2:23 PM To: solr-user@lucene.apache.org Subject: Re: How to delete records that don't contain a field? On Dec 3, 2007 5:22 PM, Jeff Leedy [EMAIL PROTECTED] wrote: I was wondering if there was a way to post a delete query using curl to delete all records that do not contain a certain field--something like this: curl http://localhost:8080/solr/update --data-binary 'deletequery-_title:[* TO *]/query/delete' -H 'Content-type:text/xml; charset=utf-8' The minus syntax seems to return the correct list of ids (that is, all records that do not contain the _title field) when I use the Solr administrative console to do the above query, so I'm wondering if Solr just doesn't support this type of delete. Not yet... it makes sense to support this in the future though. -Yonik
RE: out of heap space, every day
Thanks! I've seen a few formulae like this go by over the months. Can someone please make a wiki page for memory and processing estimation with locality properties? Or is there a Lucene page we can use? Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, December 04, 2007 8:06 AM To: solr-user@lucene.apache.org Subject: Re: out of heap space, every day On Dec 4, 2007 10:59 AM, Brian Whitman [EMAIL PROTECTED] wrote: For faceting and sorting, yes. For normal search, no. Interesting you mention that, because one of the other changes since last week besides the index growing is that we added a sort to an sint field on the queries. Is it reasonable that a sint sort would require over 2.5GB of heap on a 8M index? Is there any empirical data on how much RAM that will need? int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. One can decrease this memory usage by using an integer instead of an sint field if you don't need range queries. The memory usage would then drop to a straight int[maxDoc()] (4 bytes per document). -Yonik
Re: out of heap space, every day
int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. This is great, but can you help me parse this? Assume 8M docs and I'm sorting on an int field that is unix time (seonds since epoch.) For the purposes of the experiment assume every doc was indexed at a unique time. so.. (int[800] + String[800], each term is 16 chars + 800*4) * 2 that's 384MB by my calculation. Is that right?
RE: out of heap space, every day
String[nTerms()]: Does this mean that you compare the first term, then the second, etc.? Otherwise I don't understand how to compare multiple terms in two records. Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, December 04, 2007 8:06 AM To: solr-user@lucene.apache.org Subject: Re: out of heap space, every day On Dec 4, 2007 10:59 AM, Brian Whitman [EMAIL PROTECTED] wrote: For faceting and sorting, yes. For normal search, no. Interesting you mention that, because one of the other changes since last week besides the index growing is that we added a sort to an sint field on the queries. Is it reasonable that a sint sort would require over 2.5GB of heap on a 8M index? Is there any empirical data on how much RAM that will need? int[maxDoc()] + String[nTerms()] + size_of_all_unique_terms. Then double that to allow for a warming searcher. One can decrease this memory usage by using an integer instead of an sint field if you don't need range queries. The memory usage would then drop to a straight int[maxDoc()] (4 bytes per document). -Yonik
Re: out of heap space, every day
On Dec 4, 2007 3:11 PM, Norskog, Lance [EMAIL PROTECTED] wrote: String[nTerms()]: Does this mean that you compare the first term, then the second, etc.? Otherwise I don't understand how to compare multiple terms in two records. Lucene sorting only supports a single term per document for a field. The String array stores all the value of all the unique terms (so nTerms() above should be numberUniqueTerms) See Lucene's FieldCache.StringIndex -Yonik
Re: out of heap space, every day
See Lucene's FieldCache.StringIndex To understand just what's getting stored for each string field, you may also want to look at the createValue() method of the inner Cache object instantiated as stringsIndexCache in FieldCacheImpl.java (line 399 in HEAD): http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheImpl.java?view=markup -Charlie
Re: Cache use
Thanks for the suggestion, Dennis. I decided to implement this as you described on my collection of about 400,000 documents, but I did not receive the results I expected. Prior to putting the indexes on a tmpfs, I did a bit of benchmarking and found that it usually takes a little under two seconds for each facet query. After moving my indexes from disk to a tmpfs file system, I seem to get about the same result from facet queries: about two seconds. Does anyone have any insight into this? Doesn't it seem odd that my response times are about the same? Thanks for the help. Matt Phillips Dennis Kubes wrote: One way to do this if you are running on linux is to create a tempfs (which is ram) and then mount the filesystem in the ram. Then your index acts normally to the application but is essentially served from Ram. This is how we server the Nutch lucene indexes on our web search engine (www.visvo.com) which is ~100M pages. Below is how you can achieve this, assuming your indexes are in /path/to/indexes: mv /path/to/indexes /path/to/indexes.dist mkdir /path/to/indexes cd /path/to mount -t tmpfs -o size=2684354560 none /path/to/indexes rsync --progress -aptv indexes.dist/* indexes/ chown -R user:group indexes This would of course be limited by the amount of RAM you have on the machine. But with this approach most searches are sub-second. Dennis Kubes Evgeniy Strokin wrote: Hello,... we have 110M records index under Solr. Some queries takes a while, but we need sub-second results. I guess the only solution is cache (something else?)... We use standard LRUCache. In docs it says (as far as I understood) that it loads view of index in to memory and next time works with memory instead of hard drive. So, my question: hypothetically, we can have all index in memory if we'd have enough memory size, right? In this case the result should come up very fast. We have very rear updates. So I think this could be a solution. How should I configure the cache to achieve such approach? Thanks for any advise. Gene
Re: out of heap space, every day
It seems to me that another way to write the formula -- borrowing Python syntax -- is: 4 * numDocs + 38 * len(uniqueTerms) + 2 * sum([len(t) for t in uniqueTerms]) That's 4 bytes per document, plus 38 bytes per term, plus 2 bytes * the sum of the lengths of the terms. (Numbers taken from http://martin.nobilitas.com/java/sizeof.html) Does that seem right? -Charlie On Dec 4, 2007 12:31 PM, Charles Hornberger [EMAIL PROTECTED] wrote: See Lucene's FieldCache.StringIndex To understand just what's getting stored for each string field, you may also want to look at the createValue() method of the inner Cache object instantiated as stringsIndexCache in FieldCacheImpl.java (line 399 in HEAD): http://svn.apache.org/viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/FieldCacheImpl.java?view=markup -Charlie
Re: synonyms
Hi, I had to work with this kind of sides effects reguarding multiwords synonyms. We installed solr on our project that extensively uses synonyms, a big list that sometimes could bring out some wrong match as the one noticed by Anuvenk for instance dui = drunk driving defense or dui,drunk driving defense,drunk driving law query for dui matches dui = drunk driving defense and dui,drunk driving defense,drunk driving law in order to prevent this kind of behavior I gave for every synonyms family (saying a single line in the file) a unique identifier, so the list looks like : dui = HIER_FAMILIY_01 drunk driving defense = HIER_FAMILIY_01 SYN_FAMILY_01, dui,drunk driving defense,drunk driving law I also set the synonyms filter at index time with expand=false, and at query time with expand=false so in this way, the matched synonyms (multi words or single words) in documents are replaced with their family identifier, and not all the possibilities. Indexing with expand=true will add words in documents that could be matched alone, ignoring the fact that they belong to multiwords expression, and this could end up with a wrong match (intending syns mix) at query time. so in this way a query for dui, will be changed by the synonym filter at query time with HIER_FAMILIY_01 or SYN_FAMILY_01 so documents that contains only single words like drunk, driving or law will not be matched since only a document with the phrase drunk driving law would have been indexed with SYN_FAMILY_01. The approach worked pretty good on our project and we do not notice any sides effects on the searches, it only removes matched documents that were considered as noise of the synonyms mix issue. I think this could be usefull to add this kind of approach on the solr synoyms filter section of the wiki, Cheers Laurent On Dec 2, 2007 3:41 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi (changing to solr-user list) Yes it is, especially if the terms left of = are multi-spaced. Check out the Wiki, one page there explains this nicely. Otis - Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: anuvenk [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Saturday, December 1, 2007 1:21:49 AM Subject: Re: synonyms Ideally, would it be a good idea to pass the index data through the synonyms filter while indexing? Also, say i have this mapping dui = drunk driving defense or dui,drunk driving defense,drunk driving law so matches for dui, will also bring up matches for drunk driving law (the whole phrase) or does it also bring up all matches for 'drunk' , 'driving','law' ? Yonik Seeley wrote: On Nov 30, 2007 5:39 PM, anuvenk [EMAIL PROTECTED] wrote: Should data be re-indexed everytime synonyms like word1,word2 or word1 = word2 are added to synonyms.txt Yes, if it changes the index (if it's used in the index anaylzer as opposed to just the query analyzer). -Yonik -- View this message in context: http://www.nabble.com/synonyms-tf4925232.html#a14100346 Sent from the Solr - Dev mailing list archive at Nabble.com.
SOLR sorting - question
Do I need to select the fields in the query that I am trying to sort on?, for example if I want sort on update date then do I need to select that field? Thanks,
Re: SOLR sorting - question
I don't think you have to. Just try the query on the REST interface and you will know. On Dec 5, 2007 9:56 AM, Kasi Sankaralingam [EMAIL PROTECTED] wrote: Do I need to select the fields in the query that I am trying to sort on?, for example if I want sort on update date then do I need to select that field? Thanks, -- Regards, Cuong Hoang
Re: SOLR sorting - question
Kasi Sankaralingam wrote: Do I need to select the fields in the query that I am trying to sort on?, for example if I want sort on update date then do I need to select that field? I don't think so... are you getting an error? I run queries like: /select?q=*:*fl=namesort=added desc without problem ryan
solr + maven?
Is anyone managing solr projects with maven? I see: https://issues.apache.org/jira/browse/SOLR-19 but that is 1 year old If someone has a current pom.xml, can you post it on SOLR-19? I just started messing with maven, so I don't really know what I am doing yet. thanks ryan
RE: SOLR sorting - question
Thanks a ton, that worked -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 04, 2007 3:08 PM To: solr-user@lucene.apache.org Subject: Re: SOLR sorting - question Kasi Sankaralingam wrote: Do I need to select the fields in the query that I am trying to sort on?, for example if I want sort on update date then do I need to select that field? I don't think so... are you getting an error? I run queries like: /select?q=*:*fl=namesort=added desc without problem ryan
Re: LowerCaseFilterFactory and spellchecker
: It does make some sense, but I'm not sure that it should be blindly analyzed : without adding logic to handle certain cases (like the QueryParser does). : What happens if the analyzer produces two tokens? The spellchecker has to : deal with this appropriately. Spell checkers should be able to reverse : analyze the suggestions as well, so Pyhton gets corrected to Python and : not python. Similarly, ad-hco should probably suggest ad-hoc and not : adhoc. These all seem like arguments in favor of using the query analyzer for the source field ... yes, the person making the schema has to think carefully about what the analyzer does, but they already have to be equally carful about what the indexing analyzer does. Bottom line: if the indexing analyzer is used to build the dictionary, the query anlyzer should be used before looking up enteries in the dictionary. Python is only a good suggestion for Pyhton if searching for Python is going to return something. python might be a better suggestion. Likewise Python might be a good suggestion for python if it's always capitalized in the source field. -Hoss
Re: Distribution without SSH?
: I recently set up Solr with distribution on a couple of servers. I just : learned that our network policies do not permit us to use SSH with : passphraseless keys, and the snappuller script uses SSH to examine the master : Solr instance's state before it pulls the newest index via rsync. you may want to question/clarify this policy ... while it's generally a good idea to have a policy like this for *users* there's very little reason for it when you're dealing with role users ... accounts that exists solely to execute specific applications nad have limitited permissions. if you have a solruser with a passphraseless key, which only works on the specific machines running solr, and solruser can only read/write the specific files it needs to for replication, there's very little downside. : scripts, as required) to eliminate this dependency on SSH. I thought I ask the : list in case anyone has experience with this same situation or any insights : into the reasoning behind requiring SSH access to the master instance. i haven't looked at those scripts in a while, but i believe it's two fold: 1) get the name of hte most current snapshoot 2) notify the master which snapshot is being used (for the status page) -Hoss
Re: 1.2 commit script chokes on 1.2 response format
: It's a trivial fix, and it seems like it's already been done in trunk: : : http://svn.apache.org/viewvc/lucene/solr/trunk/src/scripts/commit?r1=543259r2=555612view=patch : : The change has not been applied to 1.2. It might be nice if it were. i'm not sure what you mean by applied to 1.2 ... releases are static: once published they are never changed. in the event of serious bugs (ie: security holes or crash related bugs) then point releases may be published (ie solr-1.2.1) but most bugs don't warrant this. -Hoss
Re: Tomcat6 env-entry
: It works excellently in Tomcat 6. The toughest thing I had to deal with is : discovering that the environment variable in web.xml for solr/home is : essential. If you skip that step, it won't come up. no, there's no reason why you should need to edit the web.xml file ... the solr/home property can be set in a Context configuration using an Environment directive without ever opening the solr.war. See this section of the tomcat docs for me details... http://tomcat.apache.org/tomcat-6.0-doc/config/context.html#Environment%20Entries :env-entry :env-entry-namesolr/home/env-entry-name :env-entry-typejava.lang.String/env-entry-type :env-entry-valueF:\Tomcat-6.0.14\webapps\solr/env-entry-value :/env-entry -Hoss
Re: Tomcat6 env-entry
Tomcat unpacks the jar into the webapps directory based off the context name anyway... What was the original thinking behind not having solr/home set in the web.xml -- seems like an easier way to deal with this. I would imagine most people are more familiar with setting params in web.xml than manually creating Contexts for their webapp... In fact I would take a step further and have a default value of /opt/solr (or whatever...) and if a specific user wants to change it they can just edit their web.xml? This would simplify the documentation, instead of configure your stuff in the Context -- it becomes this is the default, copy example/solr to /opt/solr (or we have a script do it) and deploy the .war - Original Message - From: Chris Hostetter [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Tuesday, December 4, 2007 6:34:55 PM (GMT-0800) America/Los_Angeles Subject: Re: Tomcat6 env-entry : It works excellently in Tomcat 6. The toughest thing I had to deal with is : discovering that the environment variable in web.xml for solr/home is : essential. If you skip that step, it won't come up. no, there's no reason why you should need to edit the web.xml file ... the solr/home property can be set in a Context configuration using an Environment directive without ever opening the solr.war. See this section of the tomcat docs for me details... http://tomcat.apache.org/tomcat-6.0-doc/config/context.html#Environment%20Entries :env-entry :env-entry-namesolr/home/env-entry-name :env-entry-typejava.lang.String/env-entry-type :env-entry-valueF:\Tomcat-6.0.14\webapps\solr/env-entry-value :/env-entry -Hoss
single word Vs multiple word search
Hi, Consider the scenario: I have indexed a document with a field1 having the values as Test solr search (having multiple words) And when i perform the keyword search as Test solr search i do get the results, whereas when i do the search for the Test, i dont get any results, Any quick inputs would be of great help... Thanks in advance. Regards, Dilip TS Starmark Services Pvt. Ltd.
RE: single word Vs multiple word search
Hi, This is in continuation with my previous mail. Iam using the SOLRInputDocument to perform the index operation. So, my question if a field to be indexed contains multiple values, then does the SOLRInputDocument performs the index for each word for that field or does it does for the set of words? Thanks in advance. Regards, Dilip TS -Original Message- From: Dilip.TS [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 05, 2007 10:48 AM To: SOLR Subject: single word Vs multiple word search Hi, Consider the scenario: I have indexed a document with a field1 having the values as Test solr search (having multiple words) And when i perform the keyword search as Test solr search i do get the results, whereas when i do the search for the Test, i dont get any results, Any quick inputs would be of great help... Thanks in advance. Regards, Dilip TS Starmark Services Pvt. Ltd.