Dismax, Sharding and Elevation
Hi all, I have discovered a strange thing with Dismax and Elevation and hope someone can enlighten me what to do. Whenever I search for something using the elevation Request Handler the hits are from a normal Lucene query (with elevated results if the search term was defined in elevation.xml). Elevation works, but only with dismax. Whenever I search using the dismax handler with elevated terms, elevation only works if I turned off sharding. Using shards results in an exception (IndexOutOfBoundsException). Complete message is listed below. Is this a bug or did I miss anything to switch in configuration? I also tried to add str name=defTypedismax/str to elevation request handler in solrconfig.xml, but that didn't help. The elevator component is integrated into the dismax search handler in arr name=last-components. Any hints appreciated! Thank you in advance Oliver My Solr-configuration for elevation request handler and elevation search component look like that: searchComponent name=elevator class=solr.QueryElevationComponent !-- pick a fieldType to analyze queries -- str name=queryFieldTypetext/str str name=config-fileelevate.xml/str /searchComponent requestHandler name=/elevate class=solr.SearchHandler startup=lazy lst name=defaults str name=echoParamsexplicit/str /lst arr name=last-components strelevator/str strdebug/str /arr /requestHandler The complete exception message I get from searching with dismax, elevation and sharding: java.lang.IndexOutOfBoundsException: Index: 1, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.solr.common.util.NamedList.getVal(NamedList.java:137) at org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardDoc.java:195) at org.apache.solr.handler.component.ShardFieldSortedHitQueue$2.compare(ShardDoc.java:233) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardDoc.java:134) at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:270) at org.apache.lucene.util.PriorityQueue.put(PriorityQueue.java:129) at org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:171) at org.apache.lucene.util.PriorityQueue.insert(PriorityQueue.java:156) at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:445) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:298) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:290) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1088) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:360) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:206) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:829) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:488) -- Oliver Marahrens TU Hamburg-Harburg / Universitätsbibliothek / Digitale Dienste Denickestr. 22 21071 Hamburg - Harburg Tel.+49 (0)40 / 428 78 - 32 91 eMail o.marahr...@tu-harburg.de -- GPG/PGP-Schlüssel: http://www.tub.tu-harburg.de/keys/Oliver_Marahrens_pub.asc -- Projekt DISCUS http://discus.tu-harburg.de Projekt TUBdok http://doku.b.tu-harburg.de
Re: spell suggest response
Similar type of work I have done earlier by using spell-check component with auto-suggest combined. Autosuggest will provide the words starting with query term and spellcheck returns the words similar to that. I have combined both suggestion in single list to display - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/spell-suggest-response-tp2233409p2247479.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dismax, Sharding and Elevation
As I seen the code for QueryElevationComponent ,there is no supports for Distributed Search i.e. query elevation does not works with shards. - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Dismax-Sharding-and-Elevation-tp2247156p2247522.html Sent from the Solr - User mailing list archive at Nabble.com.
range queries in solr
Hi, I am sorry to ask this silly question but I could not find the documentation about this and I am very new to lucene solr. I want to run a range query on one of the multivalued field e.g. I have a point say [10,20], which is the point of intersection of the diagonals of a rectangle. Now I want to run a solr query, which gives me all the points within the rectangle whose vertices are at { [8,20], [12,20], [10,18] , [10,22]}. Any help would be highly appreciated. Thanks Urlop
Solr + Hadoop
Hi, I'm trying build solr index with MapReduce (Hadoop) and I'm using https://issues.apache.org/jira/browse/SOLR-1301 but I've a problem with hadoop version and this patch. When I compile this patch, I use 0.21.0 hadoop version, I don't have any problem but when I'm trying to run my job in Hadoop (0.0.21) I get some error like this: *Exception in thread main java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected* at org.apache.solr.hadoop.SolrOutputFormat.checkOutputSpecs(SolrOutputFormat.java:147) at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:373) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:334) at org.apache.hadoop.mapreduce.Job.submit(Job.java:960) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:976) I try to override the next method: *public void checkOutputSpecs(JobContext job) throws IOException { super.checkOutputSpecs(job); if (job.getConfiguration().get(SETUP_OK) == null) { throw new IOException(Solr home cache not set up!); } }* by * public void checkOutputSpecs(Job job) throws IOException { super.checkOutputSpecs(job); if (job.getConfiguration().get(SETUP_OK) == null) { throw new IOException(Solr home cache not set up!); } }* but I continue receive some error: * java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.solr.hadoop.SolrOutputFormat at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1128) at org.apache.hadoop.mapreduce.task.JobContextImpl.getOutputFormatClass(JobContextImpl.java:203) at org.apache.hadoop.mapred.Task.initialize(Task.java:487) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:311) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method)* Please, Is someone using this patch with 0.21.0 Version Hadoop?. Can someone help me? Thanks, Joan
Re: Question on deleting all rows for an index
If this is a one-time cleanup, not something you need to do programmatically, you could delete the index directory ( solrDir/data/index ). In my case I have to stop Tomcat, delete .\index and restart Tomcat. It is very fast and starts me out with a fresh, empty, index. Noticed you are multi-core, I'm not, so this could be bogus information for you...but thought I'd toss it out just in case. -- View this message in context: http://lucene.472066.n3.nabble.com/Question-on-deleting-all-rows-for-an-index-tp2246726p2248332.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: basic document crud in an index
A/ You have to update all the fields, if you leave one off, it won't be in the document anymore. I have my 'persisted' data stored outside of Solr, so on update I get the stored data, modify it and update Solr with every field (even if one changed). You could also do a Query/Modify/Update directly in Solr, just remember to send all fields in the update. There isn't (in 1.4 anyway) a way to update specific fields only. B/ When you update, it is my understanding that, yes, the old doc is there deleted and a new doc is in place. You can't get to the old one however and it will go away at the next Optimize. I've never used it, but when you Commit you can send an optional parameter 'expungeDeletes' that should remove deleted docs as well. C/ Not that I'm aware of D/ don't know E/ That is my understanding, but I'm admittedly a little weak on that part. I just have a job that runs in the middle of the night and runs Optimize once each night, I don't dig deeper than that into what goes on. -- View this message in context: http://lucene.472066.n3.nabble.com/basic-document-crud-in-an-index-tp2246793p2248422.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr boolean operators
Hi, with the Lucene query syntax, is : a AND (a OR b) equivalent to : a (absorption) ?
Re: basic document crud in an index
To fill the gaps: b. the old version remains on disk but is flagged for deletion d. optimize equals merging, the difference is how many segments come out e. yes On Thursday 13 January 2011 15:21:54 kenf_nc wrote: A/ You have to update all the fields, if you leave one off, it won't be in the document anymore. I have my 'persisted' data stored outside of Solr, so on update I get the stored data, modify it and update Solr with every field (even if one changed). You could also do a Query/Modify/Update directly in Solr, just remember to send all fields in the update. There isn't (in 1.4 anyway) a way to update specific fields only. B/ When you update, it is my understanding that, yes, the old doc is there deleted and a new doc is in place. You can't get to the old one however and it will go away at the next Optimize. I've never used it, but when you Commit you can send an optional parameter 'expungeDeletes' that should remove deleted docs as well. C/ Not that I'm aware of D/ don't know E/ That is my understanding, but I'm admittedly a little weak on that part. I just have a job that runs in the middle of the night and runs Optimize once each night, I don't dig deeper than that into what goes on. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Solr boolean operators
To my understanding: in terms of the results that will be matched by your query ... it's the same. In terms of the score of the results no, since, if you are using the first query, the documents that will match both the a and the b terms, will match higher then the ones matching just the a term. On Thu, Jan 13, 2011 at 3:29 PM, Xavier Schepler xavier.schep...@sciences-po.fr wrote: Hi, with the Lucene query syntax, is : a AND (a OR b) equivalent to : a (absorption) ?
Re: Solr boolean operators
Ok, thanks. That's what I expected :D From: dante stroe dante.st...@gmail.com Sent: Thu Jan 13 15:56:33 CET 2011 To: solr-user@lucene.apache.org Subject: Re: Solr boolean operators To my understanding: in terms of the results that will be matched by your query ... it's the same. In terms of the score of the results no, since, if you are using the first query, the documents that will match both the a and the b terms, will match higher then the ones matching just the a term. On Thu, Jan 13, 2011 at 3:29 PM, Xavier Schepler xavier.schep...@sciences-po.fr wrote: Hi, with the Lucene query syntax, is : a AND (a OR b) equivalent to : a (absorption) ? -- Tous les courriers électroniques émis depuis la messagerie de Sciences Po doivent respecter des conditions d'usages. Pour les consulter rendez-vous sur http://www.ressources-numeriques.sciences-po.fr/confidentialite_courriel.htm
Get nearby words?
Hi, Is there a way to get the relevant nearby words in the entire index given a single word? I want to know all the relevance ranked words before and after the queried word. thanks for any tips. Darren
Re: Multi-word exact keyword case-insensitive search suggestions
Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType copyField source=cat dest=text/ copyField source=subject dest=text/ copyField source=summary dest=text/ copyField source=cause dest=text/ copyField source=status dest=text/ copyField source=urgency dest=text/ I ingest the source fields as text_ws (I know I've changed it a bit) and then copy the field to text. This seems to do what you are asking for. Adam On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.comwrote: Hi all, I'm just stuck with exact keyword for several days. Hope you guys could help me. Here is the scenario: 1. It need to be matched with multi-word keyword and case insensitive 2. Partial word or single word matching with this field is not allowed I want to know the field type definition for this field and sample solr query. I need to combine this search with my full text search which uses dismax query. Thanks -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/
Re: segment gets corrupted (after background merge ?)
I understand less and less what is happening to my solr. I did a checkIndex (without -fix) and there was an error... So a did another checkIndex with -fix and then the error was gone. The segment was alright During checkIndex I do not shut down the solr server, I just make sure no client connect to the server. Should I shut down the solr server during checkIndex ? first checkIndex : 4 of 17: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p3.del] test: open reader.OK [44824 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0] java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [7206878 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) a few minutes latter : 4 of 18: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p4.del] test: open reader.OK [44828 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; 28919124 tokens] test: stored fields...OK [7206764 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] Le 12/01/2011 16:50, Michael McCandless a écrit : Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0? It looks like new deletions were flushed against the segment (del file changed from _ncc_22s.del to _ncc_24f.del). Are you hitting any exceptions during indexing? Mike On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I got another corruption. It sure looks like it's the same type of error. (on a different field) It's also not linked to a merge, since the segment size did not change. *** good segment : 1 of 9: name=_ncc docCount=1841685 compound=false hasProx=true numFiles=9 size (MB)=6,683.447 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ncc_22s.del] test: open reader.OK [275881 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs; 204561440 tokens] test: stored fields...OK [45511958 total field count; avg 29.066 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] a few hours latter : *** broken segment : 1 of 17: name=_ncc docCount=1841685 compound=false hasProx=true numFiles=9 size (MB)=6,683.447 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ncc_24f.del] test: open reader.OK [278167 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term post_id:1599104 docFreq=1 != num docs seen 0 + num docs deleted 0] java.lang.RuntimeException: term post_id:1599104 docFreq=1 != num docs seen 0 + num docs deleted 0 at
Re: StopFilterFactory and qf containing some fields that use it and some that do not
It's a known 'issue' in dismax, (really an inherent part of dismax's design with no clear way to do anything about it), that qf over fields with different stop word definitions will produce odd results for a query with a stopword. Here's my understanding of what's going on: http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ On 1/12/2011 6:48 PM, Markus Jelsma wrote: Here's another thread on the subject: http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- td493483.html And slightly off topic: you'd also might want to look at using common grams, they are really useful for phrase queries that contain stopwords. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory Here is what debug says each of these queries parse to: 1. q=lifedefType=edismaxqf=Title ... returns 277,635 results 2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results 3. q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results 1. +DisjunctionMaxQuery((Title:life)) 2. +((DisjunctionMaxQuery((Title:life)))~1) 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) 4. +((DisjunctionMaxQuery((Contributor:the)) DisjunctionMaxQuery((Contributor:life | Title:life)))~2) I see what's going on here. Because the is a stop word for Title, it gets removed from first part of the expression. This means that Contributor is required to contain the. dismax does the same thing too. I guess I should have run debug before asking the mail list! It looks like the only workarounds I have is to either filter out the stopwords in the client when this happens, or enable stop words for all the fields that are used in qf with stopword-enabled fields. Unless...someone has a better idea?? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 12, 2011 4:44 PM To: solr-user@lucene.apache.org Cc: Jayendra Patil Subject: Re: StopFilterFactory and qf containing some fields that use it and some that do not Have used edismax and Stopword filters as well. But usually use the fq parameter e.g. fq=title:the life and never had any issues. That is because filter queries are not relevant for the mm parameter which is being used for the main query. Can you turn on the debugQuery and check whats the Query formed for all the combinations you mentioned. Regards, Jayendra On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote: I'm running into a problem with StopFilterFactory in conjunction with (e)dismax queries that have a mix of fields, only some of which use StopFilterFactory. It seems that if even 1 field on the qf parameter does not use StopFilterFactory, then stop words are not removed when searching any fields. Here's an example of what I mean: - I have 2 fields indexed: Title is textStemmed, which includes StopFilterFactory (see below). Contributor is textSimple, which does not include StopFilterFactory (see below). - The is a stop word in stopwords.txt - q=lifedefType=edismaxqf=Title ... returns 277,635 results - q=the lifedefType=edismaxqf=Title ... returns 277,635 results - q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results It seems as if the stop words are not being stripped from the query because qf contains a field that doesn't use StopFilterFactory. I did testing with combining Stemmed fields with not Stemmed fields in qf and it seems as if stemming gets applied regardless. But stop words do not. Does anyone have ideas on what is going on? Is this a feature or possibly a bug? Any known workarounds? Any advice is appreciated. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 fieldType name=textSimple class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=textStemmed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/
Re: Tuning StatsComponent
What field type do you recommend for a float stats.field for optimal Solr 1.4.1 StatsComponent performance ? float, pfloat or tfloat ? Do you recommend to index the field ? 2011/1/12 stockii st...@shopgate.com my field Type is double maybe sint is better ? but i need double ... =( -- View this message in context: http://lucene.472066.n3.nabble.com/Tuning-StatsComponent-tp2225809p2241903.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: StopFilterFactory and qf containing some fields that use it and some that do not
I appreciate the reply and blog posting. For now, I just enabled stopwords for all the fields on Qf. We have a very short list anyhow and our legacy search engine didn't even allow field-by-field configuration (stopwords are global on that system). I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Thursday, January 13, 2011 9:54 AM To: solr-user@lucene.apache.org; markus.jel...@openindex.io Cc: Dyer, James Subject: Re: StopFilterFactory and qf containing some fields that use it and some that do not It's a known 'issue' in dismax, (really an inherent part of dismax's design with no clear way to do anything about it), that qf over fields with different stop word definitions will produce odd results for a query with a stopword. Here's my understanding of what's going on: http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ On 1/12/2011 6:48 PM, Markus Jelsma wrote: Here's another thread on the subject: http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- td493483.html And slightly off topic: you'd also might want to look at using common grams, they are really useful for phrase queries that contain stopwords. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory Here is what debug says each of these queries parse to: 1. q=lifedefType=edismaxqf=Title ... returns 277,635 results 2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results 3. q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results 1. +DisjunctionMaxQuery((Title:life)) 2. +((DisjunctionMaxQuery((Title:life)))~1) 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) 4. +((DisjunctionMaxQuery((Contributor:the)) DisjunctionMaxQuery((Contributor:life | Title:life)))~2) I see what's going on here. Because the is a stop word for Title, it gets removed from first part of the expression. This means that Contributor is required to contain the. dismax does the same thing too. I guess I should have run debug before asking the mail list! It looks like the only workarounds I have is to either filter out the stopwords in the client when this happens, or enable stop words for all the fields that are used in qf with stopword-enabled fields. Unless...someone has a better idea?? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 12, 2011 4:44 PM To: solr-user@lucene.apache.org Cc: Jayendra Patil Subject: Re: StopFilterFactory and qf containing some fields that use it and some that do not Have used edismax and Stopword filters as well. But usually use the fq parameter e.g. fq=title:the life and never had any issues. That is because filter queries are not relevant for the mm parameter which is being used for the main query. Can you turn on the debugQuery and check whats the Query formed for all the combinations you mentioned. Regards, Jayendra On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote: I'm running into a problem with StopFilterFactory in conjunction with (e)dismax queries that have a mix of fields, only some of which use StopFilterFactory. It seems that if even 1 field on the qf parameter does not use StopFilterFactory, then stop words are not removed when searching any fields. Here's an example of what I mean: - I have 2 fields indexed: Title is textStemmed, which includes StopFilterFactory (see below). Contributor is textSimple, which does not include StopFilterFactory (see below). - The is a stop word in stopwords.txt - q=lifedefType=edismaxqf=Title ... returns 277,635 results - q=the lifedefType=edismaxqf=Title ... returns 277,635 results - q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results It seems as if the stop words are not being stripped from the query because qf contains a field that doesn't use StopFilterFactory. I did testing with combining Stemmed fields with not Stemmed fields in qf and it seems as if stemming gets applied regardless. But stop words do not. Does anyone have ideas on what is going on? Is this a feature or possibly a bug? Any known workarounds? Any advice is appreciated. James Dyer E-Commerce Systems Ingram Content Group (615)
Re: Improving Solr performance
On the one hand, I found really interesting those comments about the reasons for sharding. Documentation agrees you about why to split an index in several shards (big sizes problems) but I don't find any explanation about the inconvenients as an Access Control List. I guess there should be some and they can be critical in this design. Any example? On the other hand, the performance problems. I have configured big caches and I launch a test of simultaneous requests (with the same query) without commiting during the test. The caches are initially empty and after the test: namequeryResultCache stats lookups 1129 hits1120 hitratio0.99 inserts 16 evictions 0 size9 warmupTime 0 cumulative_lookups 1129 cumulative_hits 1120 cumulative_hitratio 0.99 cumulative_inserts 16 cumulative_evictions0 namedocumentCache stats lookups 6750 hits6440 hitratio0.95 inserts 310 evictions 0 size310 warmupTime 0 cumulative_lookups 6750 cumulative_hits 6440 cumulative_hitratio 0.95 cumulative_inserts 310 cumulative_evictions0 Although most of the queries are cache hits, the performance is still dependent of the number of simultaneous queries: 1 simultaneous query: 3437 ms (cache fails) 2 simultaneous queries: 594, 954 ms 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500, 2938, 3000 ms 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938, 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359, 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531, 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703, 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109, 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672, 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500, 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219, 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328, 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469, 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203, 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016, 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031, 32016, 32281 ms Is this an expected situation? Is there any technique for not being so dependent of the number simultaneuos queries? (due to economical reasons, replication in more servers is not an option) Thanks in advance (and also thanks for previous comments) -- View this message in context: http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2249108.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Term frequency across multiple documents
So you are interested in collection frequency of words. TermsComponent gives you document frequency of terms. You can modify it to give collection frequency info. http://search-lucene.com/m/of5Fn1PUOHU/ --- On Wed, 1/12/11, Juan Grande juan.gra...@gmail.com wrote: From: Juan Grande juan.gra...@gmail.com Subject: Re: Term frequency across multiple documents To: solr-user@lucene.apache.org Date: Wednesday, January 12, 2011, 6:56 PM Maybe there is a better solution, but I think that you can solve this problem using facets. You will get the number of documents where each term appears. Also, you can filter a specific set of terms by entering a query like +field:term1 OR +field:term2 OR ..., or using the facet.query parameter. Regards, Juan Grande On Wed, Jan 12, 2011 at 11:08 AM, Aaron Bycoffe abyco...@sunlightfoundation.com wrote: I'm attempting to calculate term frequency across multiple documents in Solr. I've been able to use TermVectorComponent to get this data on a per-document basis but have been unable to find a way to do it for multiple documents -- that is, get a list of terms appearing in the documents and how many times each one appears. I'd also like to be able to filter the list of terms to be able to see how many times a specific term appears, though this is less important. Is there a way to do this in Solr? Aaron
Adding a new site to existing solr configuration
I still have the default Solr example config running on Jetty. I use Cygwin to start my current site. Now I already have fully configured one solr instance with these files: \example\example-DIH\solr\db\conf\my-data-config.xml \example\example-DIH\solr\db\conf\schema.xml \example\example-DIH\solr\db\conf\solrconfig.xml Now, I wish to add ANOTHER site to my already running sites. This site ofcourse has a different data-config, but the question is: what files can/should I add to the already existing directories? What I have now is that i just added the data-config: \example\example-DIH\solr\db\conf\data-config-site2.xml But should I change anything in the schema.xml/solrconfig for this to work and to be able to run both sites simultaneously with the same web server instance? -- View this message in context: http://lucene.472066.n3.nabble.com/Adding-a-new-site-to-existing-solr-configuration-tp2249223p2249223.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: verifying that an index contains ONLY utf-8
take a look also into icu4j which is one of the contrib projects ... converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index? -- http://jetwick.com open twitter search
DataimportHandler development issue
We're just getting started with Solr and are very interested in using Solr for search applications. I've got the rss example working 1.4.1 didn't work out of the box, but we figured it out -then found fixes in the svn. Any way we are learning how to load the data/rss atom feeds into the Solr index. We are trying to modify the rss-data-import.xml file so that we can import atom feeds also. But for some reason they don't load. Here is what we have for the configuration. We've been using the DataImportHandler Development Console http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport http://localhost:8983/solr/rss/admin/dataimport.jsp?handler=/rssimport to look at the status and the DocsNum but only the rss feed works. If we remove all the slashdot -rss entity the atom example still doesn't work. We've tried creating a seperate atom-data-config.xml file and adding the proper entry to the solrconfig.xml to support the extra dataimport. That gave us the same results. response − lst name=responseHeader int name=status0/int int name=QTime1/int /lst − lst name=initArgs − lst name=defaults str name=configatom-data-config.xml/str /lst /lst str name=commandstatus/str str name=statusidle/str str name=importResponse/ − lst name=statusMessages str name=Total Requests made to DataSource1/str str name=Total Rows Fetched0/str str name=Total Documents Skipped0/str str name=Full Dump Started2011-01-13 08:42:53/str str name=Total Documents Processed0/str str name=Time taken 0:0:0.519/str /lst − str name=WARNING This response format is experimental. It is likely to change in the future. /str /response Its not clear why its not working. Advice? Also is this the best way to load data? We intent on loading several thousand docbook documents once we understand how this all works. We stuck with the rss/atom example since we didn't want to deal with schema changes yet. Thanks Derek example-DIH/solr/rss/conf/rss-data-config.xml modified source: dataConfig dataSource type=URLDataSource / document entity name=slashdot pk=link url=http://twitter.com/statuses/user_timeline/existdb.rss; processor=XPathEntityProcessor forEach=/rss/channel | /rss/channel/item transformer=DateFormatTransformer field column=source xpath=/rss/channel/title commonField=true / field column=source-link xpath=/rss/channel/link commonField=true / field column=subject xpath=/rss/channel/subject commonField=true / field column=title xpath=/rss/channel/item/title / field column=link xpath=/rss/channel/item/link / field column=description xpath=/rss/channel/item/description / field column=creator xpath=/rss/channel/item/creator / field column=item-subject xpath=/rss/channel/item/subject / field column=date xpath=/rss/channel/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/rss/channel/item/department / field column=slash-section xpath=/rss/channel/item/section / field column=slash-comments xpath=/rss/channel/item/comments / /entity entity name=twitter pk=link url=http://twitter.com/statuses/user_timeline/ctg_ualbany.atom; processor=XPathEntityProcessor forEach=/feed | /feed/entry transformer=DateFormatTransformer field column=source xpath=/feed/title commonField=true / field column=source-link xpath=/feed/link commonField=true / field column=subject xpath=/feed/subtitle commonField=true / field column=title xpath=/feed/entry/title / field column=link xpath=/feed/entry/link / field column=description xpath=/feed/entry/description / field column=creator xpath=/feed/entry/creator / field column=item-subject xpath=/feed/entry/subject / field column=date xpath=/rss/channel/item/date dateTimeFormat=-MM-dd'T'hh:mm:ss / field column=slash-department xpath=/feed/entry/department / field column=slash-section xpath=/feed/entry/section / field column=slash-comments xpath=/feed/entry/comments / /entity /document /dataConfig
Re: verifying that an index contains ONLY utf-8
Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8. There is no good way to do what you're doing, once you've lost track of what encoding something is in, you are reduced to applying heuristics to text strings to guess what encoding it is meant to be. There is no cheap way to do this to an entire Solr index, you're just going to have to fetch every single (stored field, indexed fields are pretty much lost to you) and apply heuristic algorithms to it. Keep in mind that Solr really probably shouldn't ever be used as your canonical _store_ of data; Solr isn't a 'store', it's an index. So you really ought to have this stuff stored somewhere else if you want to be able to examine it or modify it like this, and just deal with that somewhere else. This isn't really a Solr question at all, really, even if you are querying Solr on stored fields to try and guess their char encodings. There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure. On 1/13/2011 1:12 PM, Peter Karich wrote: take a look also into icu4j which is one of the contrib projects ... converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index?
Re: segment gets corrupted (after background merge ?)
Generally it's not safe to run CheckIndex if a writer is also open on the index. It's not safe because CheckIndex could hit FNFE's on opening files, or, if you use -fix, CheckIndex will change the index out from under your other IndexWriter (which will then cause other kinds of corruption). That said, I don't think the corruption that CheckIndex is detecting in your index would be caused by having a writer open on the index. Your first CheckIndex has a different deletes file (_phe_p3.del, with 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with 44828 deleted docs), so it must somehow have to do with that change. One question: if you have a corrupt index, and run CheckIndex on it several times in a row, does it always fail in the same way? (Ie the same term hits the below exception). Is there any way I could get a copy of one of your corrupt cases? I can then dig... Mike On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I understand less and less what is happening to my solr. I did a checkIndex (without -fix) and there was an error... So a did another checkIndex with -fix and then the error was gone. The segment was alright During checkIndex I do not shut down the solr server, I just make sure no client connect to the server. Should I shut down the solr server during checkIndex ? first checkIndex : 4 of 17: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p3.del] test: open reader.OK [44824 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0] java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [7206878 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) a few minutes latter : 4 of 18: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p4.del] test: open reader.OK [44828 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; 28919124 tokens] test: stored fields...OK [7206764 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] Le 12/01/2011 16:50, Michael McCandless a écrit : Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0? It looks like new deletions were flushed against the segment (del file changed from _ncc_22s.del to _ncc_24f.del). Are you hitting any exceptions during indexing? Mike On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I got another corruption. It sure looks like it's the same type of error. (on a different field) It's also not linked to a merge, since the segment size did not change. *** good segment : 1 of 9: name=_ncc docCount=1841685 compound=false hasProx=true numFiles=9 size (MB)=6,683.447 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_ncc_22s.del] test: open reader.OK [275881 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [17952652 terms; 174113812 terms/docs pairs; 204561440 tokens] test: stored fields...OK
Re: verifying that an index contains ONLY utf-8
The tokens that Lucene sees (pre-4.0) are char[] based (ie, UTF16), so the first place where invalid UTF8 is detected/corrected/etc. is during your analysis process, which takes your raw content and produces char[] based tokens. Second, during indexing, Lucene ensures that the incoming char[] tokens are valid UTF16. If an invalid char sequence is hit, eg naked (unpaired) surrogate, or invalid surrogate pair, the behavior is undefined, but, today, Lucene will replace such invalid char/s with the unicode character U+FFFD, so you could iterate all terms looking for that replacement char. Mike On Wed, Jan 12, 2011 at 5:16 PM, Paul p...@nines.org wrote: We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index?
Variable datasources
I have several similar databases that I'd like to import from 14 to be exact. there is also a 15th database where I can get a listing of the 14 database. I'm trying to do a variable datasource such as: datasource url=jdbc:mysql://localhost/${local.code} name=content / datasource url=jdbc:mysql://localhost/master name=master / then my import query looks like this document name=items entity datasource=master name=local query=select code from locals rootEntity=false entity datasource=content name=item query= select *, ${local.code} as code from item / /entity /document The above configuration works, but the ${local.code} variable is ONLY resolved the first time so it looks throught the correct # of times, and I can see {$local.code} being resolved in each of the item queries, but the data source never changes. I also tried creating datasources for each local and then using a variable datasource in the entity such as: datasource url=jdbc:mysql://localhost/aaa name=content_aaa / datasource url=jdbc:mysql://localhost/bbb name=content_bbb / datasource url=jdbc:mysql://localhost/ccc name=content_ccc / datasource url=jdbc:mysql://localhost/master name=master / and then the document as: document name=items entity datasource=master name=local query=select code from locals rootEntity=false entity datasource=content_${local.code} name=item query= select *, ${local.code} as code from item / /entity /document but the ${local.code} variable is not resolved and it attempts to connect to the literal source content_${local.code}. any ideas how I can get all of the items imported for all of the locals at once? -- View this message in context: http://lucene.472066.n3.nabble.com/Variable-datasources-tp2249568p2249568.html Sent from the Solr - User mailing list archive at Nabble.com.
start value in queries zero or one based?
Do I even need a body for this message? ;-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: verifying that an index contains ONLY utf-8
On Thu, Jan 13, 2011 at 2:05 PM, Jonathan Rochkind rochk...@jhu.edu wrote: There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure. it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess. also, in general keep in mind that java CharsetDecoders tend to silently replace or skip illegal chars, rather than throw exceptions. If you want to instead be paranoid about these things, instead of opening InputStreamReader with Charset, open it with something like charset.newDecoder().onMalformedInput(CodingErrorAction.REPORT).onUnmappableCharacter(CodingErrorAction.REPORT) Then if the decoder ends up in some illegal state/byte sequence, instead of silently replacing with U+FFFD, it will throw an exception. Of course as Jonathan says, you cannot confirm that something is UTF-8. But many times, you can confirm its definitely not: see https://issues.apache.org/jira/browse/SOLR-2003 for an example practical use of this, we throw an exception if we can detect that your stopwords or synonyms file is definitely wrongly-encoded.
Re: start value in queries zero or one based?
On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote: Do I even need a body for this message? ;-) Dennis Gearon Are you asking is it or should it be? If the latter, we can also discuss Emacs and vi. wunder -- Walter Underwood K6WRU
Re: Solr + Hadoop
Hi Joan, I am not sure whether it applies, but are you really using Solr 1.4 (not 1.4.1) and were also using the Hadoop-Jars provided by this patch (0.20.1 not 0.0.21)? I ask, because I had some other issues with other classes that were related to different package-definitions etc. - in short: some import-organization failed and my IDE did not notice that when I build the files. However, this is just a guess. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Hadoop-tp2247856p2249935.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: start value in queries zero or one based?
Perhaps it would be more useful to RTFM instead of messing around on the mailing list: http://wiki.apache.org/solr/CommonQueryParameters#start Please, read every wiki page you can find and write notes. Do I even need a body for this message? ;-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
RE: start value in queries zero or one based?
Please, read every wiki page you can find and write notes. NO!!! Once you start down this road, there is no turning back! Soon you will feel the need to turn your notes into a new wiki page or a blog post, and people will read those and write notes, and the process will repeat, ad infinitum: a Vicious Circle of Writing (VCoW). Please, please, please: Don't have a VCoW, man!
Re: start value in queries zero or one based?
I'm migrating to CTO/CEO status in life due to building a small company. I find I don't have too much time for theory. I work with wht is. So, what is it, not what should it be. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Walter Underwood wun...@wunderwood.org To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 1:38:26 PM Subject: Re: start value in queries zero or one based? On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote: Do I even need a body for this message? ;-) Dennis Gearon Are you asking is it or should it be? If the latter, we can also discuss Emacs and vi. wunder -- Walter Underwood K6WRU
Re: verifying that an index contains ONLY utf-8
Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that reindexes, i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote: it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess.
RE: verifying that an index contains ONLY utf-8
So you're allowed to put the entire original document in a stored field in Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, beaurocracy. But no reason what you are doing won't work, as you of course already know from doing it. If you actually know the charset of a document when indexing it, you might want to consider putting THAT in a stored field; easier to keep track of the encoding you know then to try and guess it again later. From: Paul [p...@nines.org] Sent: Thursday, January 13, 2011 6:21 PM To: solr-user@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that reindexes, i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote: it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess.
RE: start value in queries zero or one based?
You could have tried it and seen for yourself on any Solr server in your possession in less time than it took to have this thread. And if you don't have a Solr server, then why do you care? But the answer is 0. http://wiki.apache.org/solr/CommonQueryParameters#start The default value is 0 Since the default start is 0, and if you leave start out you don't always skip the first item of your result set, that means if you DO want to skip the first item if your result set, start=1 will do it. From: Dennis Gearon [gear...@sbcglobal.net] Sent: Thursday, January 13, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: start value in queries zero or one based? I'm migrating to CTO/CEO status in life due to building a small company. I find I don't have too much time for theory. I work with wht is. So, what is it, not what should it be. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Walter Underwood wun...@wunderwood.org To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 1:38:26 PM Subject: Re: start value in queries zero or one based? On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote: Do I even need a body for this message? ;-) Dennis Gearon Are you asking is it or should it be? If the latter, we can also discuss Emacs and vi. wunder -- Walter Underwood K6WRU
Searchers and Warmups
I'm trying to understand the mechanics behind warming up, when new searchers are registered, and their costs. A quick Google didn't point me in the right direction, so hoping for some of that here. -- David Cramer
Re: Solr + Hadoop
Joan, make sure that you are running the job on Hadoop 0.21 cluster. (It looks like you have compiled the apache-solr-hadoop jar with Hadoop 0.21 but using it on 0.20 cluster). -Alexander
[sfield] Missing in Spatial Search
According to the documentation here: http://wiki.apache.org/solr/SpatialSearch the field that identifies the spatial point data is sfield. See the console output below. Jan 13, 2011 6:49:40 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={spellcheck=truef.jtype.facet.mincoun t=1facet=truef.cat.facet.mincount=1f.cause.facet.mincount=1f.urgency.facet.m incount=1rows=10start=0q=*:*f.status.facet.mincount=1facet.field=catfacet. field=jtypefacet.field=statusfacet.field=causefacet.field=urgency?=fq={!typ e%3Dgeofilt+pt%3D39.0914154052734,-84.517822265625+sfield%3Dcoords+d%3D300}text: } hits=113 status=0 QTime=1 Jan 13, 2011 6:51:51 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: missing sfield for spatial reques t Any ideas on this one? Thanks in advance, Adam
Re: Multi-word exact keyword case-insensitive search suggestions
Thanks for your reply. However, it doesn't work for my case at all. I think it's the problem with query parser or something else. It forces me to put double quote to the search query in order to get the results found. str name=rawquerystringsim 010/str str name=querystringsim 010/str str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str str name=parsedquery_toString+(keyphrase:sim 010) ()/str str name=rawquerystringsmart mobile/str str name=querystringsmart mobile/str str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str The intent here is to do a full text search, part of that is to search keyword field, so I can't put quote to it. On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType copyField source=cat dest=text/ copyField source=subject dest=text/ copyField source=summary dest=text/ copyField source=cause dest=text/ copyField source=status dest=text/ copyField source=urgency dest=text/ I ingest the source fields as text_ws (I know I've changed it a bit) and then copy the field to text. This seems to do what you are asking for. Adam On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Hi all, I'm just stuck with exact keyword for several days. Hope you guys could help me. Here is the scenario: 1. It need to be matched with multi-word keyword and case insensitive 2. Partial word or single word matching with this field is not allowed I want to know the field type definition for this field and sample solr query. I need to combine this search with my full text search which uses dismax query. Thanks -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/ -- Chhorn Chamnap http://chamnapchhorn.blogspot.com/
Re: segment gets corrupted (after background merge ?)
1) CheckIndex is not supposed to change a corrupt segment, only remove it. 2) Are you using local hard disks, or do run on a common SAN or remote file server? I have seen corruption errors on SANs, where existing files have random changes. On Thu, Jan 13, 2011 at 11:06 AM, Michael McCandless luc...@mikemccandless.com wrote: Generally it's not safe to run CheckIndex if a writer is also open on the index. It's not safe because CheckIndex could hit FNFE's on opening files, or, if you use -fix, CheckIndex will change the index out from under your other IndexWriter (which will then cause other kinds of corruption). That said, I don't think the corruption that CheckIndex is detecting in your index would be caused by having a writer open on the index. Your first CheckIndex has a different deletes file (_phe_p3.del, with 44824 deleted docs) than the 2nd time you ran it (_phe_p4.del, with 44828 deleted docs), so it must somehow have to do with that change. One question: if you have a corrupt index, and run CheckIndex on it several times in a row, does it always fail in the same way? (Ie the same term hits the below exception). Is there any way I could get a copy of one of your corrupt cases? I can then dig... Mike On Thu, Jan 13, 2011 at 10:52 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I understand less and less what is happening to my solr. I did a checkIndex (without -fix) and there was an error... So a did another checkIndex with -fix and then the error was gone. The segment was alright During checkIndex I do not shut down the solr server, I just make sure no client connect to the server. Should I shut down the solr server during checkIndex ? first checkIndex : 4 of 17: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p3.del] test: open reader.OK [44824 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...ERROR [term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0] java.lang.RuntimeException: term post_id:562 docFreq=1 != num docs seen 0 + num docs deleted 0 at org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675) at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) test: stored fields...OK [7206878 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] FAILED WARNING: fixIndex() would remove reference to this segment; full exception: java.lang.RuntimeException: Term Index test failed at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543) at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903) a few minutes latter : 4 of 18: name=_phe docCount=264148 compound=false hasProx=true numFiles=9 size (MB)=928.977 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20, java.vendor=Sun Microsystems Inc.} has deletions [delFileName=_phe_p4.del] test: open reader.OK [44828 deleted docs] test: fields..OK [51 fields] test: field norms.OK [51 fields] test: terms, freq, prox...OK [3200899 terms; 26804334 terms/docs pairs; 28919124 tokens] test: stored fields...OK [7206764 total field count; avg 32.86 fields per doc] test: term vectorsOK [0 total vector count; avg 0 term/freq vector fields per doc] Le 12/01/2011 16:50, Michael McCandless a écrit : Curious... is it always a docFreq=1 != num docs seen 0 + num docs deleted 0? It looks like new deletions were flushed against the segment (del file changed from _ncc_22s.del to _ncc_24f.del). Are you hitting any exceptions during indexing? Mike On Wed, Jan 12, 2011 at 10:33 AM, Stéphane Delprat stephane.delp...@blogspirit.com wrote: I got another corruption. It sure looks like it's the same type of error. (on a different field) It's also not linked to a merge, since the segment size did not change. *** good segment : 1 of 9: name=_ncc docCount=1841685 compound=false hasProx=true numFiles=9 size (MB)=6,683.447 diagnostics = {optimize=false, mergeFactor=10, os.version=2.6.26-2-amd64, os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0 _20,
Re: Multi-word exact keyword case-insensitive search suggestions
Ahhh...the fun of open source software ;-). Requires a ton of trial and error! I found what worked for me and figured it was worth passing it along. If you don't mind...when you sort everything out on your end, please post results for the rest of us to take a gander at. Cheers, Adam On Jan 13, 2011, at 9:08 PM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Thanks for your reply. However, it doesn't work for my case at all. I think it's the problem with query parser or something else. It forces me to put double quote to the search query in order to get the results found. str name=rawquerystringsim 010/str str name=querystringsim 010/str str name=parsedquery+DisjunctionMaxQuery((keyphrase:sim 010)) ()/str str name=parsedquery_toString+(keyphrase:sim 010) ()/str str name=rawquerystringsmart mobile/str str name=querystringsmart mobile/str str name=parsedquery +((DisjunctionMaxQuery((keyphrase:smart)) DisjunctionMaxQuery((keyphrase:mobile)))~2) () /str str name=parsedquery_toString+(((keyphrase:smart) (keyphrase:mobile))~2) ()/str The intent here is to do a full text search, part of that is to search keyword field, so I can't put quote to it. On Thu, Jan 13, 2011 at 10:30 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Hi, the following seems to work pretty well. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false / /analyzer /fieldType !-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of wifi or wi fi could match a document containing Wi-Fi. Synonyms and stopwords are customized by external files, and stemming is enabled. The attribute autoGeneratePhraseQueries=true (the default) causes words that get split to form phrase queries. For example, WordDelimiterFilter splitting text:pdp-11 will cause the parser to generate text:pdp 11 rather than (text:PDP OR text:11). NOTE: autoGeneratePhraseQueries=true tends to not work well for non whitespace delimited languages. -- fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType copyField source=cat dest=text/ copyField source=subject dest=text/ copyField source=summary dest=text/ copyField source=cause dest=text/ copyField source=status dest=text/ copyField source=urgency dest=text/ I ingest the source fields as text_ws (I know I've changed it a bit) and then copy the field to text. This seems to do what you are asking for. Adam On Thu, Jan 13, 2011 at 12:05 AM, Chamnap Chhorn chamnapchh...@gmail.com wrote: Hi all, I'm just stuck with exact keyword for several days. Hope you guys could help me. Here is the scenario: 1. It need to be matched with multi-word keyword and case insensitive 2. Partial word or single word matching with this field is not allowed I want to know the field type definition for this field and sample solr query. I need to
use of schema.xml
I'm going to buy the book for Solr, since it looks like I need to do more of the work than I thought I would. But, from looking at it, the schema file only says: A/ What types of data can be in the 'fields' of the documents B/ If there are any dynamically assigned fields. C/ What parsers are available D/ other stuff. And what it DOESN'T do is set the 'schema' for the index, right? (like DDL for a database does) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Solr 4.0 = Spatial Search - How to
Spatial does not support separate separate fields: you don't need lat/long, only 'coord'. To get latitude/longitude in the coord field from the DIH, you need to use a transformer in the DIH script. It would populate a field 'coord' with a text string made from the lat and lon fields: http://wiki.apache.org/solr/DataImportHandler?#TemplateTransformer On Wed, Jan 12, 2011 at 5:47 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: In my case, I am getting data from a database and am able to concatenate the lat/long as a coordinate pair to store in my coords field. To test this, I randomized the lat/long values and generated about 6000 documents. Adam On Wed, Jan 12, 2011 at 8:29 PM, caman aboxfortheotherst...@gmail.comwrote: Adam, thanks. Yes that helps but how does coords fields get populated? All I have is field name=lat type=tdouble indexed=true stored=true / field name=lng type=tdouble indexed=true stored=true / field name=coord type=location indexed=true stored=true / fields 'lat' and 'lng' get populated by dataimporthandler but coord, am not sure? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2245709.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: use of schema.xml
Correct. Solr and Lucene do not store or enforce the schema. You're on your own :) On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote: I'm going to buy the book for Solr, since it looks like I need to do more of the work than I thought I would. But, from looking at it, the schema file only says: A/ What types of data can be in the 'fields' of the documents B/ If there are any dynamically assigned fields. C/ What parsers are available D/ other stuff. And what it DOESN'T do is set the 'schema' for the index, right? (like DDL for a database does) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. -- Lance Norskog goks...@gmail.com
Re: use of schema.xml
Wait- it does enforce the schema names. What it does not enforce is field contents when you change the schema. Since Lucene does not have field replacement, it is not practical to remove or add a field to all existing documents when you change the schema. On Thu, Jan 13, 2011 at 8:15 PM, Lance Norskog goks...@gmail.com wrote: Correct. Solr and Lucene do not store or enforce the schema. You're on your own :) On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote: I'm going to buy the book for Solr, since it looks like I need to do more of the work than I thought I would. But, from looking at it, the schema file only says: A/ What types of data can be in the 'fields' of the documents B/ If there are any dynamically assigned fields. C/ What parsers are available D/ other stuff. And what it DOESN'T do is set the 'schema' for the index, right? (like DDL for a database does) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: use of schema.xml
I could put 1-10,000 fileds in any one document, as long as they are told what type or they are dynamically matched by dynamic fields relative to what's in the schema.xml file? It's very much like google 'big tables' or 'elastic search' that way, right? It's up to me to enforce any field names or quantities and assign field types during insert/update? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 8:16:54 PM Subject: Re: use of schema.xml Wait- it does enforce the schema names. What it does not enforce is field contents when you change the schema. Since Lucene does not have field replacement, it is not practical to remove or add a field to all existing documents when you change the schema. On Thu, Jan 13, 2011 at 8:15 PM, Lance Norskog goks...@gmail.com wrote: Correct. Solr and Lucene do not store or enforce the schema. You're on your own :) On Thu, Jan 13, 2011 at 8:09 PM, Dennis Gearon gear...@sbcglobal.net wrote: I'm going to buy the book for Solr, since it looks like I need to do more of the work than I thought I would. But, from looking at it, the schema file only says: A/ What types of data can be in the 'fields' of the documents B/ If there are any dynamically assigned fields. C/ What parsers are available D/ other stuff. And what it DOESN'T do is set the 'schema' for the index, right? (like DDL for a database does) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. -- Lance Norskog goks...@gmail.com -- Lance Norskog goks...@gmail.com
Re: Improving Solr performance
On Thu, Jan 13, 2011 at 10:10 PM, supersoft elarab...@gmail.com wrote: On the one hand, I found really interesting those comments about the reasons for sharding. Documentation agrees you about why to split an index in several shards (big sizes problems) but I don't find any explanation about the inconvenients as an Access Control List. I guess there should be some and they can be critical in this design. Any example? [...] Can I ask what might be a stupid question? How are you measuring the numbers below, and what do they mean? As your hit ratio is close to 1 (i.e., everything after the first query is coming from the cache), these numbers seem a little strange. Are these really the time for each of the N simultaneous queries? They seem to be monotonically increasing (though with a couple of strange exceptions), which leads me to suspect that they are some kind of cumulative times, e.g., by this interpretation, for the case of the 10 simultaneous queries, the first one takes 1047ms, the second 268ms, the third 125ms, and so on. We have run performance tests with pg_bench on a index of size 40GB on a single Solr server with about 6GB of RAM allocated to Solr, and see what I would think of as expected behaviour, i.e., for every fresh query term, the first query takes the longest, and the time for subsequent queries with the same term goes down dramatically, as the result is coming out of the cache. This is at odds to what you describe here, so I have to go back and check that we did not miss something important. 1 simultaneous query: 3437 ms (cache fails) 2 simultaneous queries: 594, 954 ms 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500, 2938, 3000 ms 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938, 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359, 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531, 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703, 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109, 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672, 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500, 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219, 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328, 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469, 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203, 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016, 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031, 32016, 32281 ms [...] Regards, Gora
Re: Variable datasources
On Fri, Jan 14, 2011 at 1:02 AM, tjpoe tanner.post...@gmail.com wrote: [...] I also tried creating datasources for each local and then using a variable datasource in the entity such as: datasource url=jdbc:mysql://localhost/aaa name=content_aaa / datasource url=jdbc:mysql://localhost/bbb name=content_bbb / datasource url=jdbc:mysql://localhost/ccc name=content_ccc / datasource url=jdbc:mysql://localhost/master name=master / and then the document as: document name=items entity datasource=master name=local query=select code from locals rootEntity=false entity datasource=content_${local.code} name=item query= select *, ${local.code} as code from item / /entity /document but the ${local.code} variable is not resolved and it attempts to connect to the literal source content_${local.code}. [...] As you have discovered, the datasource attribute is not variable resolved. There was a thread on this subject a couple of days ago, and apparently Alexei has resolved the issue. Please see: http://www.mail-archive.com/solr-user@lucene.apache.org/msg45407.html Regards, Gora
Re: Adding a new site to existing solr configuration
On Thu, Jan 13, 2011 at 10:47 PM, PeterKerk vettepa...@hotmail.com wrote: I still have the default Solr example config running on Jetty. I use Cygwin to start my current site. Now I already have fully configured one solr instance with these files: \example\example-DIH\solr\db\conf\my-data-config.xml \example\example-DIH\solr\db\conf\schema.xml \example\example-DIH\solr\db\conf\solrconfig.xml Now, I wish to add ANOTHER site to my already running sites. This site ofcourse has a different data-config, but the question is: what files can/should I add to the already existing directories? [...] If I understand your requirements correctly, the easiest way would be to do the following: * Copy the entire directory example/example-DIH/solr/db to a new one, say example/example-DIH/solr/test * As this is running a multi-core setup, add the new site as a different core instance in example/example-DIH/solr/solr.xml. Thus, just before the /cores line, add: core default=false instanceDir=test name=test/core * example/example-DIH/solr/test/conf/solrconfig.xml is already set up to use db-data-config.xml as the DIH configuration file, so you can make any changes there. Else, change the name of db-data-config.xml, and modify the config attribute of the /dataimport RequestHandler in solrconfig.xml. * Make any desired changes to schema.xml, e.g., if you have different fields, or if they are of different types. * Start Solr, and run it as usual, as per example/example-DIH/README. E.g., a dataimport would be initiated by loading http://localhost:8983/solr/test/dataimport?command=full-import Regards, Gora
Re: Solr 4.0 = Spatial Search - How to
I have used that type of location searching. But I have not used spatial search. I wrote my logic at application end. I have cached the location ids and their lat/lang. When queries are comming for any location say New Delhi then my location searche logic at application end calculate the distance from New Delhi to other locations from my cache and short lists the only location which are in my radious. and then I have goto solr for search on all locations i have got from my logic. It works faster because it worked on only some data near about 500 locations. But in spatial search that calculation is done for all document counts which we have . So this workaround does not impact on performance when my index size will grow up but spatial search do. - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253682.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.0 = Spatial Search - How to
Thanks Here was the issues. Concatenating 2 floats(lat,lng) at mysql end converted it to a BLOB. Indexing would fail in storing BLOB in 'location' type field. After BLOB issue was resolved, all worked ok. Thank you all for your help -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-0-Spatial-Search-How-to-tp2245592p2253691.html Sent from the Solr - User mailing list archive at Nabble.com.