another spellchecker question
hi :) I've noticed that (with solr 1.2) the returned order (as well as the actual matched set) is affected by the number of matches you ask for: q=hannasuggestionCount=1 suggestions:[Yanna] q=hannasuggestionCount=2 suggestions:[Manna, Yanna] q=hannasuggestionCount=5 suggestions:[Manna, Nanna, Sanna, Vanna, Shanna] note how the #1 result is completely missing from the top 5... or at least that's how I _used_ to think about the sets :) unfortunately, extendedresults seems to be a 1.3-only option, so I can't see what's going on here. but I guess I'm asking if this is expected behavior. --Geoff
Re: another spellchecker question
Hi Geoffrey, Yes, this is a caveat in the lucene contrib spellchecker which Solr uses. From the lucene spell checker javadocs: * pAs the Lucene similarity that is used to fetch the most relevant n-grammed terms * is not the same as the edit distance strategy used to calculate the best * matching spell-checked word from the hits that Lucene found, one usually has * to retrieve a couple of numSug's in order to get the true best match. * * pI.e. if numSug == 1, don't count on that suggestion being the best one. * Thus, you should set this value to bat least/b 5 for a good suggestion. Therefore what you're seeing is by design. Probably we should change the default number of suggestions when querying lucene spellchecker to 5 and give back the top result if the user asks for only one suggestion from solr. On Wed, Apr 23, 2008 at 5:58 PM, Geoffrey Young [EMAIL PROTECTED] wrote: hi :) I've noticed that (with solr 1.2) the returned order (as well as the actual matched set) is affected by the number of matches you ask for: q=hannasuggestionCount=1 suggestions:[Yanna] q=hannasuggestionCount=2 suggestions:[Manna, Yanna] q=hannasuggestionCount=5 suggestions:[Manna, Nanna, Sanna, Vanna, Shanna] note how the #1 result is completely missing from the top 5... or at least that's how I _used_ to think about the sets :) unfortunately, extendedresults seems to be a 1.3-only option, so I can't see what's going on here. but I guess I'm asking if this is expected behavior. --Geoff -- Regards, Shalin Shekhar Mangar.
Re: another spellchecker question
Shalin Shekhar Mangar wrote: Hi Geoffrey, Yes, this is a caveat in the lucene contrib spellchecker which Solr uses. From the lucene spell checker javadocs: * pAs the Lucene similarity that is used to fetch the most relevant n-grammed terms * is not the same as the edit distance strategy used to calculate the best * matching spell-checked word from the hits that Lucene found, one usually has * to retrieve a couple of numSug's in order to get the true best match. * * pI.e. if numSug == 1, don't count on that suggestion being the best one. * Thus, you should set this value to bat least/b 5 for a good suggestion. Therefore what you're seeing is by design. Probably we should change the default number of suggestions when querying lucene spellchecker to 5 and give back the top result if the user asks for only one suggestion from solr. great, thanks for all that - I'm still trying to figure out where all the relevant docs live. you've been really helpful. --Geoff
Solr multicore admin JSP problem on tomcat
I have successfully setup a Solr multicore configuration on Apache Tomcat 5.5 (Solaris 9, JDK 5). I used the 4/21/2008 nightly build for this purpose. At present, I have two cores defined. I can index and search documents on both these cores using the java client. I'm having a minor issue on the Admin interface and I think I might have missed some configuration steps causing this error. Here is the description of the error: 1. I use the following URL to successfully browse to the Admin interface of one of the cores: http://devbox:8080/solr/solrtest/admin/ 2. On the resulting page, I click on the link [SCHEMA] 3. This results in a 404 error. The link to this page is http://devbox:8080/solr/solrtest/admin/file/?file=schema.xml 4. If I change the link to http://devbox:8080/solr/solrtest/admin/get-file.jsp?file=schema.xml, the schema xml is displayed properly. The same problem happens for the [CONFIG] link. Can someone please advise me how to fix the issue? Thanks Suman
SOLR-470 default value in schema with NOW
So I just ran into this bug: https://issues.apache.org/jira/browse/SOLR-470 and read about this related one: https://issues.apache.org/jira/browse/SOLR-544 Here is the relevant trace: Apr 22, 2008 10:59:01 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: java.text.ParseException: Unparseable date: 2008-04-03T22:42:13Z at org.apache.solr.schema.DateField.toObject(DateField.java:173) at org.apache.solr.schema.DateField.toObject(DateField.java:83) at org.apache.solr.update.DocumentBuilder.loadStoredFields(DocumentBuilder.java:285) ... Caused by: java.text.ParseException: Unparseable date: 2008-04-03T22:42:1 at java.text.DateFormat.parse(Unknown Source) The root cause (I believe, am going to confirm tonight) is that I have multiple index files I'm uploading into this column in the schema: field name=timestamp_created type=date indexed=true stored=true required=true multiValued=false default=NOW / Here is my typedef for 'date': fieldType name=date class=solr.DateField sortMissingLast=true omitNorms=true/ What I came to realize is that my index files contain this column value consistently specified, but one of my files does not contain the column at all. Due to my indication of a default value, I am reliant on the SOLR default for NOW being in the same format (no millis, .0, .00, .000, etc) as I have passed in my feed. As you can see from the exception, my feed does not contain any millis which is a valid format according to 544 and the documentation I've read. Now finally, my problem. The format for NOW doesn't seem to be documented so I have no idea what I need to 'match' (or even that matching is necessary from the documentation outside these 2 bugs) in order to take advantage of the default value feature and mix that with data from my streams. I can see from here that it isn't the 'no millis' form since a discrepancy is triggering this bug. Solutions? A) Should I create a format normalizer and configure that into my typedef for 'date' so that I am agnostic of these differences in terms of input and insure the indexed format is consistent? I believe this would be a analyzer type=indexfilter ...//analyzer. I'm not concerned about the presence or absence of millis on the output. Would this approach work? Based on the presence of the filter in the fieldType, it feels like a hack. B) Should I remove the default value and just insure all my streams have this value specified consistently an not trigger the bug? It seems to me that SOLR should be robust in this respect, but reading SOLR-544 I can see that this isn't an opinion that is held by all. C) Should I apply one of the existing SOLR-470 patch files and move on? D) Should I take a stab at https://issues.apache.org/jira/browse/SOLR-440 as an alternative 'class' for my 'date' type? Thanks, Brian
Re: Highlighted field gets truncated
On 22-Apr-08, at 6:00 PM, Christian Wittern wrote: Mike Klaas wrote: On 19-Apr-08, at 3:02 AM, Christian Wittern wrote: So it could be that the match is not part of the fragment? This sounds a bit strange. Is there a way to make sure the fragment contains the match other than returning the whole field and do the fragmenting myself? [...] As you can see, only fragments containing a match are returned (note that there is very often multiple matches--you seemed to assume only one). Mike, thank you for the clarification. Now I understand what went wrong in the example I looked at. I am querying ngram indexed data (Chinese text). A user enters two or three characters and expect them to be matched more or less as a substring match. The fragment I looked at did contain only one of the characters (the other was cut off at the end), this is what made me wondering. From what you say, even adding quotation marks around the query will not prevent this from happening (in this case, it would simply obscure the match). Are there any plans to improve the algorithm for fragmentation? Or are there other work arounds? LUCENE-794 contains an implementation that solves this problem. My plan is to eventually integrate this into Solr one day, but I don't see myself having time for this in the short or medium term. Contributions welcome :) -Mike
MoreLikeThis patch to support boost factor
This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). Any feedback? Jonathan Index: /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java === --- /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java (revision 651048) +++ /home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java (working copy) @@ -284,6 +284,11 @@ private final IndexReader ir; /** + * Boost factor to use when boosting the terms + */ +private int boostFactor = 1; + +/** * Constructor requiring an IndexReader. */ public MoreLikeThis(IndexReader ir) { @@ -574,7 +579,7 @@ } float myScore = ((Float) ar[2]).floatValue(); -tq.setBoost(myScore / bestScore); +tq.setBoost(boostFactor * myScore / bestScore); } try { @@ -921,6 +926,22 @@ x = 1; } } + +/** + * Returns the boost factor used when boosting terms + * @return the boost factor used when boosting terms + */ + public int getBoostFactor() { + return boostFactor; + } + + /** +* Sets the boost factor to use when boosting terms +* @param boostFactor +*/ + public void setBoostFactor(int boostFactor) { + this.boostFactor = boostFactor; + } }
Re: MoreLikeThis patch to support boost factor
Hi Jonathan, Could you put this in a new JIRA issue? Do you also have a unit test one could run to see how/that this works? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jonathan Ariel [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, April 23, 2008 4:52:19 PM Subject: MoreLikeThis patch to support boost factor This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). Any feedback? Jonathan
Re: Got parseException when search keyword AND on a text field
Otis, Thanks for the reply. Is there a list of words that have special meaning? Thanks Xuesong Re: Got parseException when search keyword AND on a text field Otis Gospodnetic Fri, 18 Apr 2008 18:39:45 -0700 Xuesong, AND has a special meaning - it is a boolean AND when capitalized. That is why you are getting an error - the query parser doesn't know what to do with just AND for a query. Otis
Re: MoreLikeThis patch to support boost factor
Yes. Sure. I'll do that. Just wanted some feedback before posting it. As soon as I do it I'll post the issue number. Thanks! On Wed, Apr 23, 2008 at 6:39 PM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Hi Jonathan, Could you put this in a new JIRA issue? Do you also have a unit test one could run to see how/that this works? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Jonathan Ariel [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, April 23, 2008 4:52:19 PM Subject: MoreLikeThis patch to support boost factor This is a patch I made to be able to boost the terms with a specific factor beside the relevancy returned by MoreLikeThis. This is helpful when having more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be boosted more than words in the field B (i.e. Description). Any feedback? Jonathan
Re: Got parseException when search keyword AND on a text field
Not in one place and documented. The place to look are query parsers, but things like AND OR NOT TO are the ones to look out for. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Xuesong Luo [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, April 23, 2008 8:45:24 PM Subject: Re: Got parseException when search keyword AND on a text field Otis, Thanks for the reply. Is there a list of words that have special meaning? Thanks Xuesong Re: Got parseException when search keyword AND on a text field Otis Gospodnetic Fri, 18 Apr 2008 18:39:45 -0700 Xuesong, AND has a special meaning - it is a boolean AND when capitalized. That is why you are getting an error - the query parser doesn't know what to do with just AND for a query. Otis
Re: Got parseException when search keyword AND on a text field
Oh come on Otis, give our Solr wiki and Lucene documentation some kudos here! :) I think this stuff is pretty well documented starting here: http://wiki.apache.org/solr/SolrQuerySyntax Not to mention that dusty ol' book on Lucene... Erik On Apr 23, 2008, at 9:28 PM, Otis Gospodnetic wrote: Not in one place and documented. The place to look are query parsers, but things like AND OR NOT TO are the ones to look out for. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Xuesong Luo [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, April 23, 2008 8:45:24 PM Subject: Re: Got parseException when search keyword AND on a text field Otis, Thanks for the reply. Is there a list of words that have special meaning? Thanks Xuesong Re: Got parseException when search keyword AND on a text field Otis Gospodnetic Fri, 18 Apr 2008 18:39:45 -0700 Xuesong, AND has a special meaning - it is a boolean AND when capitalized. That is why you are getting an error - the query parser doesn't know what to do with just AND for a query. Otis