[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-07-12 Thread Thomas Peuss (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511989
 ] 

Thomas Peuss commented on SOLR-81:
--

Hello Otis!

What happened to the TokenFilters included in the patch? They are in the patch 
but in trunk I don't see them.

CU
Thomas

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-23 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483795
 ] 

Ryan McKinley commented on SOLR-81:
---

  * can't do relative path to dataDir, because we can't getdataDir,
because SolrCore isn't done initializing yet.

with SOLR-182, SolrCore gets initialized first - so we could use relative paths 
during handler initialization.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483804
 ] 

Otis Gospodnetic commented on SOLR-81:
--

I haven't applied this and tried it, but I looked at the patch, and like the 
changes.  The only issues I could stop are 3 typos that we can clean up later.

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12483805
 ] 

Otis Gospodnetic commented on SOLR-81:
--

Hoss, another poossibly interesting and useful addition:

Make use of public RAMDirectory(Directory dir) and allow one to specify that 
even though the spellchecker index exists in FS, use it only to pull it into a 
RAMDir-based index.  Might not be a huge win because most spellchecker indices 
are probably pretty small and easily fit in RAM already, even when they are 
FSDir-based, but I thought I'd mention it anyway.




 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: hoss.spell.patch, SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-16 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481738
 ] 

Otis Gospodnetic commented on SOLR-81:
--

This is in SVN now, but I'm going to leave this open for another week, in case 
Hoss, Adam, or anyone else finds any issues.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-16 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481759
 ] 

Otis Gospodnetic commented on SOLR-81:
--

There is a useless (I think) static IndexReader in there:
private static IndexReader reader = null;

If we set this to some real IndexReader, we can get the SpellChecker to act as 
follows (from its coffeedocs):


   * @param ir the indexReader of the user index (can be null see field param)
   * @param field String the field of the user index: if field is not null, the 
suggested
   * words are restricted to the words present in this field.
   * @param morePopular boolean return only the suggest words that are more 
frequent than the searched word
   * (only if restricted mode = (indexReader!=null and field!=null)

  public String[] suggestSimilar(String word, int numSug, IndexReader ir,
  String field, boolean morePopular) throws IOException {

So, should we do this on init:
  reader = req.getSearcher().getReader();
?
Or maybe add a new param to solrconfig.xml's declaration of the 
SpellCheckerRequestHandler that turns this on/off?

Thoughts?


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-13 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480409
 ] 

Otis Gospodnetic commented on SOLR-81:
--

Adam:
Have you started making the changes that Hoss proposed here?  Please let me 
know (today, if you can).  If you have not started, I'll make the changes.  If 
you've started, I'll hold off.

Hoss  Adam:

1) out with tokenizer factories - right, they are no longer needed.

2) I'll stick to the absolute path for now, get that in SVN, and then we can 
add support for other things... unless you show me an example of how easy it is 
to support other paths/locations

3) merging the handlers sounds ok:
  to get suggestions: ...?qt=spellcheckercmd=suggest 
  to completely rebuild: ...?qt=spellcheckercmd=rebuild
OK?
The use-case here is to rebuild the index every once in a while, *not* on every 
change of the main index.

4) I'll leave that for later, as I don't completely understand you there.

5) ok, no static SpellChecker

6) ok, sounds like we just need remove the wrapping lst name=invariants 
element

7) I actually liked having a separate example doc for demonstrating just the 
spellchecker functionality -- you don't have to know about those other 
documents/fields/values.  But if both Adam and Hoss think differently, we 
should go with the majority's opinion.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-13 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480660
 ] 

Hoss Man commented on SOLR-81:
--

Otis: haven't had a chance to look at your newest patch yet, but just to 
clarify my comment#4... In the last patch i looked at, LuceneDictionary could 
be used to build the dictionary based on a field name from the index -- but 
this will only work for simple String or TextFields.

Theoretically, someone could write a ROT132FieldType that munges up the field 
values stored in it, if you were to try and build a SpellChecker index from 
this field, nothing good would come of it just using LUceneDIctionary (because 
of hte way it uses hte raw TermEnum) .. but since we have the IndexSchema, we 
can get the FieldType for the field name we want to use, and then the 
indexedToReadable method on each indexed term will tell you the plain text 
version.

it's a minor thing, but it's a good thing to take into account.

Alternately, we can just document that it doesn't make sense to use any field 
type except StrField (even TextField doens't really make sense since we can't 
anticipate what hte Analyzer might have done)


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-07 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478893
 ] 

Hoss Man commented on SOLR-81:
--

looking over both Otis's patches and Adam's patches for hte first time i find 
myself really confused.

As previously discussed in email, there are two completley different appraoches 
that could be taken to achieve spell correction using Solr:

1) Use something like the Lucene SpellChecker contrib to make suggestions 
basedon the data in the main solr index (defined by the solr schema) ... adding 
hooks to Solr to keep the SpellChecker system aware of changes to the main 
index, and hooks to allow requesthandlers to return suggestions with each query

2) use the main solr index (defined by the schema) to store the dictionary of 
words, turning the entire solr instance into one giant SpellChecker.  In this 
case there would be a recomended schema.xml for users who want to setup a 
SpellChecker Solr instance and possible a custom RequestHandler htat assumes 
you are using this schema.


These two patches both seem to be dealing with case#1, but they have hints of 
approach#2 ... for example i don't entirely understand why they include the 
NGram tokenfilter factories, since they don't seem to need the fields of the 
solr index to be tokenized in any special way (since the lucene SpellChecker 
controls the format of it's dictionary).   It's also not clear do me what the 
purpose of the SpellCheckerRequestHandler is ... if the main index is storing 
real user records, then wouldn't a helper method that existing request 
handlers (like dismax and standard) can optionally call to get the SpellChecker 
data be more useful?

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-06 Thread Adam Hiatt (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478302
 ] 

Adam Hiatt commented on SOLR-81:


BTW updated patch added.

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch, 
 SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-05 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12478277
 ] 

Otis Gospodnetic commented on SOLR-81:
--

Adam:

I can merge our patches to produce a unified one.

NOTE:
The SpellCheckerCommitRequestHandler assumes that:
  a) one wants to populate the spellchecker index with data from another Lucene 
index.
  b) the Lucene index to be used for populating is available on the same box 
where the spellchecker service is running.

I think both a) and b) are good - let those who want this functionality have it.
However, some may not be able to live with these assumptions (e.g. one may want 
to have a server dedicated to spellchecker service, and may not want to push 
the source Lucene index to the spellchecker box.)  For those people, the 
approach that includes schema.xml modifications will be required, unless I'm 
missing something.  Am I?

Also, I think this is a mistake:

accuracy = p.getFloat(accuracy, DEFAULT_NUM_SUGGESTIONS);

You probably wanted DEFAULT_ACCURACY there, but that doesn't exist yet, so I'll 
fix that.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-03-02 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477547
 ] 

Yonik Seeley commented on SOLR-81:
--

Is spelling check normally going to be integrated into the main index, or 
will it normally be a separate index?
If the latter, does it make more sense for some of this (the field definitions 
 handler) to be in contrib instead of core?

Any other way to avoid cluttering the current schema.xml?

If spelling check is to be a core feature (that one can turn on for any field 
in any index), it seems like it needs to be easier to configure.  Having the 
user define all the ngram fields, fieldTypes, and copyField statements doesn't 
seem ideal.

If, however, this is more of a configuration of solr used for spell-checking, 
it might make more sense for contrib.

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch, SOLR-81-ngram.patch, 
 SOLR-81-ngram.patch, SOLR-81-ngram.patch, SOLR-81-spellchecker.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-02-16 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473813
 ] 

Otis Gospodnetic commented on SOLR-81:
--

Adam:
Please look at LUCENE-759.  That incorporates your patch, fixes a bug I found 
in it, and introduces a new bug, so we are not too bored with bug-free code.  
Any idea how to extract that last n-gram when using Side.BACK?


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-02-16 Thread Adam Hiatt (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473848
 ] 

Adam Hiatt commented on SOLR-81:


What was the bug? I couldn't tell from the Lucene issue description.





 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, 
 SOLR-81-ngram-schema.patch, SOLR-81-ngram.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2007-02-05 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470295
 ] 

Otis Gospodnetic commented on SOLR-81:
--

Adam,

I took a look at your patch.  It looks like you brought over (copied) various 
n-gram tokenizer classes and their unit tests that I put in Lucene's 
contrib/analyzers/ .  Did you do this on purpose?  I intentionally put 
those n-gram tokenizers under Lucene's contrib, as they are generic and not 
Solr-specific.  Thus, the only classes my patch has are classes that are 
Solr-specific:

src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java
src/java/org/apache/solr/analysis/NGramTokenizerFactory.java
src/java/org/apache/solr/analysis/BaseTokenizerFactory.java

And instead of copying the source classes from Lucene's contrib/analyzers/ 
it adds the new jar built from those sources:
lib/lucene-analyzers-2.1-dev.jar

Plus:
lib/lucene-spellchecker-2.1-dev.jar
example/solr/conf/schema.xml

I have some locally modified code for this issue, that was not a part of the 
first patch.  I wanted to attach the updated patch assuming you didn't really 
want those few generic tokenizer classes copied from Lucene over to Solr, but 
because changes are now in two places, so to speak, let's do this to unify our 
work:

Could you please:
- open a new LUCENE issue or just reopen the one where I originally attached 
this code and post your patch to the Lucene tokenizers there.
- prepare a new patch for this issue and make sure it only contains 
Solr-specific classes (see above), plus those 2 Jars.  

I'll upload my patch for schema.xml, so you can see my config (your patch 
didn't have this), and make sure your changes to the code are in sync with that.

Finally, are you making use of this code somehow already?
One thing that is completely missing from this patch is the RequestHandler that 
knows how to take the input (a query string), and get suggestions for 
alternative spellings via a SpellChecker instance.  I have some 
NGramRequestHandler code locally, but the code is unfinished.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: https://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-edgengram-ngram.patch, SOLR-81-ngram.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2006-12-23 Thread Chris Hostetter

: Yeah, I've used the Lucene-based spellchecker before, I just never had
: to hook it up with Solr.  At this point I'm not interested in the fancy
: stuff (cache, RAMDir...), I just want to figure out how to configure it
: via schema.xml...

But the crux of the issue is that if you are maintaining a second index
inside your base Solr installation for the purposes of the Spellchecker
class, then you don't want or need to configure it in schema.xml -- it
lives outside the schema space.

I pointed this out the last time spellchecking came up, there are two
extremely differnet approaches involved when you talk about implimenting
a spelling/suggestion service with Solr...

In the first approach, the main SOlr index *is* the suggestion index ...
each Document represents a suggested word, with one stored field telling
you what the word is, and indexed fields containing the ngrams.  you could
populate this index from any initial source: a dictionary, logs of popular
query terms, or a dump of all terms in your corpus.  At query time, your
application would query this index seperately from querying your main
Solr index containing your domain specific data.

The second approach is to have the spelling/suggestion index live inside
of your Solr index side by side with your main domain specific index, so
your Request Handler can talk to it directly, and it can be populated
directly using the terms in your corpus -- this sounds like the
approach you are taking, but in this approach there is no need for your
schema.xml to know anything about the index .. just use the SpellChecker
class as is: construct it with an empty RAMDirectory and call
indexDictionary on a LuceneDictionary pointed at your main Solr index.
The only code you really need to write is something to run clearIndex and
indexDirectory as a newSearcher hook  (the easiest way probably being to
hang your Spellchecker instance off of a single element Solr cache nad
write a Regenerator)

But like i said: you dodn't need to worry about making the schema know
about your ngrams -- you do that if you're going for the first approach.



-Hoss



[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2006-12-22 Thread Otis Gospodnetic (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460405 ] 

Otis Gospodnetic commented on SOLR-81:
--

This patch contains 3 new classes for org.apache.solr.analysis:
1. NGramTokenizerFactory
2. NGramTokenizer
3. NGramTokenizerTest (all tests pass)
+ 1 modified class:
4. BaseTokenizerFactory

I *think* the above can be configured in schema.xml as follows:

fieldtype name=gram1 class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
tokenizer class=solr.NGramTokenizerFactory minGram=1 maxGram=1/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype
fieldtype name=gram2 class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
tokenizer class=solr.NGramTokenizerFactory minGram=2 maxGram=2/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype
fieldtype name=gram3 class=solr.TextField
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory/
tokenizer class=solr.NGramTokenizerFactory minGram=3 maxGram=3/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype

And I *believe* the following fields would have to be defined (to match the 
fields in Spellchecker.java):

field name=word type=string indexed=true stored=true 
multiValued=false/
field name=start1 type=string indexed=true stored=true 
multiValued=false/  **
field name=end1 type=string indexed=true stored=true 
multiValued=false/ **
field name=start2 type=string indexed=true stored=true 
multiValued=false/ **
field name=end2 type=string indexed=true stored=true 
multiValued=false/ **
field name=start3 type=string indexed=true stored=true 
multiValued=false/ **
field name=end3 type=string indexed=true stored=true 
multiValued=false/ **
field name=start4 type=string indexed=true stored=true 
multiValued=false/ **
field name=end4 type=string indexed=true stored=true 
multiValued=false/ **
field name=gram1 type=gram1 indexed=true stored=true 
multiValued=false/
field name=gram2 type=gram2 indexed=true stored=true 
multiValued=false/
field name=gram3 type=gram3 indexed=true stored=true 
multiValued=false/
field name=gram4 type=gram4 indexed=true stored=true 
multiValued=false/

c.f. http://wiki.apache.org/jakarta-lucene/SpellChecker
I am not sure how to configure the fields marked with  ** above.
Maybe I don't even need startN/endN fields.  I am not sure how endN fields 
would be useful.  The startN are probably useful because those can get an extra 
boost.

I *think* the above config (except for ** fields, which I don't know how to 
handle) will do the following.
If the input (query string) is pork, my ngrammer may generate the following 
uni- and bi-gram tokens:

  p o r k po or rk

And this is how I think they will get mapped to fields and indexed:
word: pork
gram1: p o r k
gram2: po or rk
start1: p **
start2: po **
end1 rk **
end2: rk **

Again, not sure how to achieve **.

I haven't actually tried this.  I am only modifying my local 
example/solr/conf/schema.xml for now, and I haven't actually indexed anything 
with the above config.

Thoughts/comments?

 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: http://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor
 Attachments: SOLR-81-ngram.patch


 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2006-12-21 Thread Otis Gospodnetic (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12460331 ] 

Otis Gospodnetic commented on SOLR-81:
--

Ogün - yes, that Spellchecker class in Lucene's contrib/spellchecker has 1.0f 
defined as the boost for the last n-gram.  I'm not even sure if that's needed.  
I talked to Bob Carpenter (alias-i.com) about it recently, and he said boosting 
the end ngram doesn't make sense, if I remember correctly.  I'm inclined to go 
remove that from the source completely.  Thoughts?

I'm unsure about how to integrate the Lucene spellchecker code into Solr, 
though.  There is no n-gram tokenizer per se in the spellchecker extension, 
so I can't really point NGramFilter config in Solr's schema.xml to anything in 
that spellchecker library I can write my own n-gram Filter, that's not a 
problem, but you said you made use of the Lucene spellchecker code, and I can't 
see how to do that.

Did you simply create your own NGramFilter that creates the same ngrams as 
Spellchecker.java, and then used the Spellchecker.suggest(String word) method 
*only* for fetching/getting alternative spelling suggestions?


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: http://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor

 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2006-12-13 Thread Otis Gospodnetic (JIRA)
[ 
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12458052 ] 

Otis Gospodnetic commented on SOLR-81:
--

Something like this, then?

fieldtype name=queryString class=solr.TextField 
positionIncrementGap=1
  analyzer
   tokenizer class=solr.NGramTokenizerFactory/  !-- Or maybe just make 
an NGramAnalyzer? --
  filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldtype

Plus:

copyField source=word dest=word_start1/
copyField source=word dest=word_end1/
copyField source=word dest=word_start2/
copyField source=word dest=word_end2/
copyField source=word dest=word_start3/
copyField source=word dest=word_end3/
copyField source=word dest=word_gram1/
copyField source=word dest=word_gram2/
copyField source=word dest=word_gram3/
copyField source=word dest=word_gram4/ 

I'd probably also want to give those word_start* n-grams some boost, though I 
don't see how to do that in schema.xml yet.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: http://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor

 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (SOLR-81) Add Query Spellchecker functionality

2006-12-12 Thread JIRA
[ 
http://issues.apache.org/jira/browse/SOLR-81?page=comments#action_12457871 ] 

Ogün Bilge commented on SOLR-81:


I have created a NGramFilter for generating those gram fields based on the 
word field
It is configurable with the schema.xml file simply by generating fieldtypes and 
using the 
copyField directive. 
The generated documents can be used with the Lucene spellchecker extension to 
fetch a 
suggest word.
Unfortunately i have made this extension during my working hours and so i can't 
apply the 
asf license on it, but if you have question please ask.


 Add Query Spellchecker functionality
 

 Key: SOLR-81
 URL: http://issues.apache.org/jira/browse/SOLR-81
 Project: Solr
  Issue Type: New Feature
  Components: search
Reporter: Otis Gospodnetic
Priority: Minor

 Use the simple approach of n-gramming outside of Solr and indexing n-gram 
 documents.  For example:
 doc
 field name=wordlettuce/field
 field name=start3let/field
 field name=gram3let ett ttu tuc uce/field
 field name=end3uce/field
 field name=start4lett/field
 field name=gram4lett ettu ttuc tuce/field
 field name=end4tuce/field
 /doc
 See:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg01254.html
 Java clients: SOLR-20 (add delete commit optimize), SOLR-30 (search)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira