another spellchecker question

2008-04-23 Thread Geoffrey Young

hi :)

I've noticed that (with solr 1.2) the returned order (as well as the 
actual matched set) is affected by the number of matches you ask for:


  q=hannasuggestionCount=1
suggestions:[Yanna]

  q=hannasuggestionCount=2
suggestions:[Manna,
  Yanna]

  q=hannasuggestionCount=5
suggestions:[Manna,
  Nanna,
  Sanna,
  Vanna,
  Shanna]

note how the #1 result is completely missing from the top 5... or at 
least that's how I _used_ to think about the sets :)


unfortunately, extendedresults seems to be a 1.3-only option, so I can't 
see what's going on here.  but I guess I'm asking if this is expected 
behavior.


--Geoff


Re: another spellchecker question

2008-04-23 Thread Shalin Shekhar Mangar
Hi Geoffrey,
Yes, this is a caveat in the lucene contrib spellchecker which Solr uses.
From the lucene spell checker javadocs:

* pAs the Lucene similarity that is used to fetch the most relevant
n-grammed terms
   * is not the same as the edit distance strategy used to calculate the
best
   * matching spell-checked word from the hits that Lucene found, one
usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * pI.e. if numSug == 1, don't count on that suggestion being the best
one.
   * Thus, you should set this value to bat least/b 5 for a good
suggestion.

Therefore what you're seeing is by design. Probably we should change the
default number of suggestions when querying lucene spellchecker to 5 and
give back the top result if the user asks for only one suggestion from solr.

On Wed, Apr 23, 2008 at 5:58 PM, Geoffrey Young [EMAIL PROTECTED]
wrote:

 hi :)

 I've noticed that (with solr 1.2) the returned order (as well as the
 actual matched set) is affected by the number of matches you ask for:

  q=hannasuggestionCount=1
suggestions:[Yanna]

  q=hannasuggestionCount=2
suggestions:[Manna,
  Yanna]

  q=hannasuggestionCount=5
suggestions:[Manna,
  Nanna,
  Sanna,
  Vanna,
  Shanna]

 note how the #1 result is completely missing from the top 5... or at
 least that's how I _used_ to think about the sets :)

 unfortunately, extendedresults seems to be a 1.3-only option, so I can't
 see what's going on here.  but I guess I'm asking if this is expected
 behavior.

 --Geoff




-- 
Regards,
Shalin Shekhar Mangar.


Re: another spellchecker question

2008-04-23 Thread Geoffrey Young



Shalin Shekhar Mangar wrote:

Hi Geoffrey,
Yes, this is a caveat in the lucene contrib spellchecker which Solr uses.

From the lucene spell checker javadocs:


* pAs the Lucene similarity that is used to fetch the most relevant
n-grammed terms
   * is not the same as the edit distance strategy used to calculate the
best
   * matching spell-checked word from the hits that Lucene found, one
usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * pI.e. if numSug == 1, don't count on that suggestion being the best
one.
   * Thus, you should set this value to bat least/b 5 for a good
suggestion.

Therefore what you're seeing is by design. Probably we should change the
default number of suggestions when querying lucene spellchecker to 5 and
give back the top result if the user asks for only one suggestion from solr.


great, thanks for all that - I'm still trying to figure out where all 
the relevant docs live.  you've been really helpful.


--Geoff


Solr multicore admin JSP problem on tomcat

2008-04-23 Thread Suman Ghosh
I have successfully setup a Solr multicore configuration on Apache Tomcat
5.5 (Solaris 9, JDK 5). I used the 4/21/2008 nightly build for this purpose.
At present, I have two cores defined. I can index and search documents on
both these cores using the java client.

I'm having a minor issue on the Admin interface and I think I might have
missed some configuration steps causing this error. Here is the description
of the error:

1. I use the following URL to successfully browse to the Admin interface of
one of the cores:

http://devbox:8080/solr/solrtest/admin/

2. On the resulting page, I click on the link [SCHEMA]

3. This results in a 404 error. The link to this page is
http://devbox:8080/solr/solrtest/admin/file/?file=schema.xml

4. If I change the link to
http://devbox:8080/solr/solrtest/admin/get-file.jsp?file=schema.xml, the
schema xml is displayed properly.

The same problem happens for the [CONFIG] link.

Can someone please advise me how to fix the issue?

Thanks
Suman


SOLR-470 default value in schema with NOW

2008-04-23 Thread Brian Johnson
So I just ran into this bug:
https://issues.apache.org/jira/browse/SOLR-470

and read about this related one:
https://issues.apache.org/jira/browse/SOLR-544

Here is the relevant trace:

Apr 22, 2008 10:59:01 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: java.text.ParseException: Unparseable date: 
2008-04-03T22:42:13Z
at org.apache.solr.schema.DateField.toObject(DateField.java:173)
at org.apache.solr.schema.DateField.toObject(DateField.java:83)
at 
org.apache.solr.update.DocumentBuilder.loadStoredFields(DocumentBuilder.java:285)
...
Caused by: java.text.ParseException: Unparseable date: 2008-04-03T22:42:1
at java.text.DateFormat.parse(Unknown Source)

The root cause (I believe, am going to confirm tonight) is that I have multiple 
index files I'm uploading into this column in the schema:
   field name=timestamp_created type=date indexed=true stored=true 
required=true multiValued=false default=NOW /

Here is my typedef for 'date':
fieldType name=date class=solr.DateField sortMissingLast=true 
omitNorms=true/


What I came to realize is that my index files contain this column value 
consistently specified, but one of my files does not contain the column at all. 
Due to my indication of a default value, I am reliant on the SOLR default for 
NOW being in the same format (no millis, .0, .00, .000, etc) as I have passed 
in my feed. As you can see from the exception, my feed does not contain any 
millis which is a valid format according to 544 and the documentation I've 
read. 

Now finally, my problem. The format for NOW doesn't seem to be documented so I 
have no idea what I need to 'match' (or even that matching is necessary from 
the documentation outside these 2 bugs) in order to take advantage of the 
default value feature and mix that with data from my streams. I can see from 
here that it isn't the 'no millis' form since a discrepancy is triggering this 
bug. 

Solutions?

A) Should I create a format normalizer and configure that into my typedef for 
'date' so that I am agnostic of these differences in terms of input and insure 
the indexed format is consistent? I believe this would be a analyzer 
type=indexfilter ...//analyzer. I'm not concerned about the presence or 
absence of millis on the output. Would this approach work? Based on the 
presence of the filter in the fieldType, it feels like a hack.

B) Should I remove the default value and just insure all my streams have this 
value specified consistently an not trigger the bug? It seems to me that SOLR 
should be robust in this respect, but reading SOLR-544 I can see that this 
isn't an opinion that is held by all.

C) Should I apply one of the existing SOLR-470 patch files and move on?

D) Should I take a stab at https://issues.apache.org/jira/browse/SOLR-440 as an 
alternative 'class' for my 'date' type?

Thanks,

Brian





Re: Highlighted field gets truncated

2008-04-23 Thread Mike Klaas

On 22-Apr-08, at 6:00 PM, Christian Wittern wrote:

Mike Klaas wrote:

On 19-Apr-08, at 3:02 AM, Christian Wittern wrote:
So it could be that the match is not part of the fragment?  This  
sounds a bit strange.  Is there a way to make sure the fragment  
contains the match other than returning the whole field and do the  
fragmenting myself?



[...]
As you can see, only fragments containing a match are returned  
(note that there is very often multiple matches--you seemed to  
assume only one).


Mike, thank you for the clarification.  Now I understand what went  
wrong in the example I looked at.   I am querying ngram indexed   
data (Chinese text).  A user enters two or three characters and  
expect them to be matched more or less as a substring match.  The  
fragment I looked at did contain only one of the characters (the  
other was cut off at the end), this is what made me wondering.
From what you say, even adding quotation marks around the query will  
not prevent this from happening (in this case, it would simply  
obscure the match).
Are there any plans to improve the algorithm for fragmentation?  Or  
are there other work arounds?


LUCENE-794 contains an implementation that solves this problem.  My  
plan is to eventually integrate this into Solr one day, but I don't  
see myself having time for this in the short or medium term.


Contributions welcome :)

-Mike


MoreLikeThis patch to support boost factor

2008-04-23 Thread Jonathan Ariel
This is a patch I made to be able to boost the terms with a specific factor
beside the relevancy returned by MoreLikeThis. This is helpful when having
more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title)
can be boosted more than words in the field B (i.e. Description).

Any feedback?

Jonathan
Index: 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
===
--- 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
(revision 651048)
+++ 
/home/developer/workspace/lucene/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
(working copy)
@@ -284,6 +284,11 @@
 private final IndexReader ir;
 
 /**
+ * Boost factor to use when boosting the terms
+ */
+private int boostFactor = 1;
+
+/**
  * Constructor requiring an IndexReader.
  */
 public MoreLikeThis(IndexReader ir) {
@@ -574,7 +579,7 @@
 }
 float myScore = ((Float) ar[2]).floatValue();
 
-tq.setBoost(myScore / bestScore);
+tq.setBoost(boostFactor * myScore / bestScore);
 }
 
 try {
@@ -921,6 +926,22 @@
 x = 1;
 }
 }
+
+/**
+ * Returns the boost factor used when boosting terms
+ * @return the boost factor used when boosting terms
+ */
+   public int getBoostFactor() {
+   return boostFactor;
+   }
+
+   /**
+* Sets the boost factor to use when boosting terms
+* @param boostFactor
+*/
+   public void setBoostFactor(int boostFactor) {
+   this.boostFactor = boostFactor;
+   }
 
 
 }


Re: MoreLikeThis patch to support boost factor

2008-04-23 Thread Otis Gospodnetic
Hi Jonathan,
Could you put this in a new JIRA issue?  Do you also have a unit test one could 
run to see how/that this works?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
From: Jonathan Ariel [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, April 23, 2008 4:52:19 PM
Subject: MoreLikeThis patch to support boost factor

This is a patch I made to be able to boost the terms with a specific
factor beside the relevancy returned by MoreLikeThis. This is helpful
when having more then 1 MoreLikeThis in the query, so words in the
field A (i.e. Title) can be boosted more than words in the field B
(i.e. Description).

Any feedback?

Jonathan


Re: Got parseException when search keyword AND on a text field

2008-04-23 Thread Xuesong Luo
Otis, Thanks for the reply. Is there a list of words that have special
meaning? 

Thanks
Xuesong  
 


Re: Got parseException when search keyword AND on a text field
Otis Gospodnetic
Fri, 18 Apr 2008 18:39:45 -0700

Xuesong,

AND has a special meaning - it is a boolean AND when capitalized.  That
is why 
you are getting an error - the query parser doesn't know what to do with
just 
AND for a query.

Otis 


Re: MoreLikeThis patch to support boost factor

2008-04-23 Thread Jonathan Ariel
Yes. Sure. I'll do that. Just wanted some feedback before posting it. As
soon as I do it I'll post the issue number.
Thanks!

On Wed, Apr 23, 2008 at 6:39 PM, Otis Gospodnetic 
[EMAIL PROTECTED] wrote:

 Hi Jonathan,
 Could you put this in a new JIRA issue?  Do you also have a unit test one
 could run to see how/that this works?

 Thanks,
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Jonathan Ariel [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, April 23, 2008 4:52:19 PM
 Subject: MoreLikeThis patch to support boost factor

 This is a patch I made to be able to boost the terms with a specific
 factor beside the relevancy returned by MoreLikeThis. This is helpful
 when having more then 1 MoreLikeThis in the query, so words in the
 field A (i.e. Title) can be boosted more than words in the field B
 (i.e. Description).

 Any feedback?

 Jonathan



Re: Got parseException when search keyword AND on a text field

2008-04-23 Thread Otis Gospodnetic
Not in one place and documented.  The place to look are query parsers, but 
things like AND OR NOT TO are the ones to look out for.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
 From: Xuesong Luo [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Wednesday, April 23, 2008 8:45:24 PM
 Subject: Re: Got parseException when search keyword AND on a text field
 
 Otis, Thanks for the reply. Is there a list of words that have special
 meaning? 
 
 Thanks
 Xuesong  
 
 
 
 Re: Got parseException when search keyword AND on a text field
 Otis Gospodnetic
 Fri, 18 Apr 2008 18:39:45 -0700
 
 Xuesong,
 
 AND has a special meaning - it is a boolean AND when capitalized.  That
 is why 
 you are getting an error - the query parser doesn't know what to do with
 just 
 AND for a query.
 
 Otis 



Re: Got parseException when search keyword AND on a text field

2008-04-23 Thread Erik Hatcher
Oh come on Otis, give our Solr wiki and Lucene documentation some  
kudos here! :)  I think this stuff is pretty well documented starting  
here:


http://wiki.apache.org/solr/SolrQuerySyntax

Not to mention that dusty ol' book on Lucene...

Erik

On Apr 23, 2008, at 9:28 PM, Otis Gospodnetic wrote:
Not in one place and documented.  The place to look are query  
parsers, but things like AND OR NOT TO are the ones to look out for.



Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 

From: Xuesong Luo [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, April 23, 2008 8:45:24 PM
Subject: Re: Got parseException when search keyword AND on a text  
field


Otis, Thanks for the reply. Is there a list of words that have  
special

meaning?

Thanks
Xuesong



Re: Got parseException when search keyword AND on a text field
Otis Gospodnetic
Fri, 18 Apr 2008 18:39:45 -0700

Xuesong,

AND has a special meaning - it is a boolean AND when capitalized.   
That

is why
you are getting an error - the query parser doesn't know what to  
do with

just
AND for a query.

Otis