Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-04 Thread thomas
Concerning the downtime, we found a solution that works well for us. We
allready implemented an update mechanism so that when authors are changing
some content in the cms, the index regarding this piece of content gets
updated (delete than index again) as well.

All we had to do is:
1. Change the schema.xml to support the PhoneticFilter in certain fieldtypes
2. Write a script that finds all individual content items
3. Starting the update mechanism for each piece of content item on after
another.

So the index slowly emerges from the old to the new phonetic state without
any noticeable downtime for users using the search function. Its just that
they get kind of mixed results for the time of the transition. Sure it needs
some time, but we can have cms users working with content all the time. If
they create or update content during the transition it will be indexed,
reindexed followinf the new schema.xml anyway.

If we need to rollback we just replace the schema.xml with the old version
and start the update process again. 

So far this is working, thanks for your support!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3225223.html
Sent from the Solr - User mailing list archive at Nabble.com.


German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread thomas
Hi,
we have several entries in our database our customer would like to find when
using a not exactly matching search string. The Problem is kind of related
to spelling correction and synonyms. But instead of single entries in
synonyms.txt we would like a automatic solution for this group of problems:

When searching for the name: schmid we want to find also documents with
the name schmidt included. There are analog names like hildebrand and
hildebrandt and more. That is the reason we'd like to find a automatic
solution for this group of words.

We allready use the following filters in our index chain
filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary_de.txt/
filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/

Unfortunatelly the german stemmer is not handling such problems. Nor is this
a problem related to compound words.

Does anyone know of a solution? maybe its possible to set up a filter rule
to extend words ending with letter d automatically with letter t in the 
query chain? Or other direction to remove t letters after d letters in
index chain.

Thanks a lot
Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216278.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Alexei Martchenko
I'd try solr.PhoneticFilterFactory, it usually converts these slight
differences... schmidt, smith and schmid will be something like XMDT

2011/8/1 thomas tom.erfu...@googlemail.com

 Hi,
 we have several entries in our database our customer would like to find
 when
 using a not exactly matching search string. The Problem is kind of related
 to spelling correction and synonyms. But instead of single entries in
 synonyms.txt we would like a automatic solution for this group of problems:

 When searching for the name: schmid we want to find also documents with
 the name schmidt included. There are analog names like hildebrand and
 hildebrandt and more. That is the reason we'd like to find a automatic
 solution for this group of words.

 We allready use the following filters in our index chain
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary_de.txt/
 filter class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/

 Unfortunatelly the german stemmer is not handling such problems. Nor is
 this
 a problem related to compound words.

 Does anyone know of a solution? maybe its possible to set up a filter rule
 to extend words ending with letter d automatically with letter t in the
 query chain? Or other direction to remove t letters after d letters in
 index chain.

 Thanks a lot
 Thomas

 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216278.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Paul Libbrecht
Thomas,

an alternative would be to use the Kölner phonetic factory.
A recent discussion happened about it.
But all this needs some programming.

paul

Le 1 août 2011 à 17:41, Alexei Martchenko a écrit :

 I'd try solr.PhoneticFilterFactory, it usually converts these slight
 differences... schmidt, smith and schmid will be something like XMDT
 
 2011/8/1 thomas tom.erfu...@googlemail.com
 
 Hi,
 we have several entries in our database our customer would like to find
 when
 using a not exactly matching search string. The Problem is kind of related
 to spelling correction and synonyms. But instead of single entries in
 synonyms.txt we would like a automatic solution for this group of problems:
 
 When searching for the name: schmid we want to find also documents with
 the name schmidt included. There are analog names like hildebrand and
 hildebrandt and more. That is the reason we'd like to find a automatic
 solution for this group of words.
 
 We allready use the following filters in our index chain
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
 dictionary=dictionary_de.txt/
 filter class=solr.SnowballPorterFilterFactory language=German2
 protected=protwords.txt/
 
 Unfortunatelly the german stemmer is not handling such problems. Nor is
 this
 a problem related to compound words.
 
 Does anyone know of a solution? maybe its possible to set up a filter rule
 to extend words ending with letter d automatically with letter t in the
 query chain? Or other direction to remove t letters after d letters in
 index chain.
 
 Thanks a lot
 Thomas
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216278.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 -- 
 
 *Alexei Martchenko* | *CEO* | Superdownloads
 ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
 5083.1018/5080.3535/5080.3533



Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread thomas
Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind
Any changes you make related to stemming or normalization are likely 
going to require a re-index, just how it goes, just how solr/lucene 
works.  What you can do just by normalizing at query time is limited, 
almost any good solution to this type of problem is going to require 
normalization at index time.


If you're going to be fiddling with a production solr, it pays to figure 
out a workflow such that you can introduce indexing changes without 
downtime, this is not the last time you'll have to do it.


On 8/1/2011 12:35 PM, thomas wrote:

Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Paul Libbrecht

Le 1 août 2011 à 18:35, thomas a écrit :

 Thanks Alexei,
 Thanks Paul,
 
 I played with the solr.PhoneticFilterFactory. Analysing my query in solr
 admin backend showed me how and that it is working. My major problem is,
 that this filter needs to be applied to the index chain as well as to the
 query chain to generate matches for our search. We have a huge index at this
 point and i'am not really happy to reindex all content.

I doubt there's a way out.

 Is there maybe a more subtle solution which is working by just manipulating
 the query chain only?

You'd need to programme it... it's not excluded.

 Otherwise i need to backup the whole index and try to reindex overnight when
 cms users are sleeping.

With some work you can do this using an extra solr that just pulls everything, 
then swaps the indexes (that needs a bit of downtime), then re-indexes the 
things changed during the night.
I feel this should be a standard feature of SOLR...

 I will have a look into the ColognePhonetic encoder. Im just afraid ill have
 to reindex the whole content there as well.

Sure, absolutely.
Also note that using phonetics really needs a separate field with query 
expansion (which is easy with dismax).

paul

Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind

On 8/1/2011 12:42 PM, Paul Libbrecht wrote:
Otherwise i need to backup the whole index and try to reindex 
overnight when

cms users are sleeping.

With some work you can do this using an extra solr that just pulls everything, 
then swaps the indexes (that needs a bit of downtime), then re-indexes the 
things changed during the night.
I feel this should be a standard feature of SOLR...



It sort of is, in the sense that you can do it with replication, with no 
downtime. (Although you'll need enough disk and RAM in the slave to warm 
the replicated index while still serving queries from the older index, 
for no downtime).


Reindex to a seperate solr (or seperate core), then have the actual 
production core set up as a slave, and have it replicate from master 
when the re-indexing is done.  You can have your relevant conf files 
(schema or solrconfig) set up to replicate too, so you get those new 
ones in production exactly when you get the new indexes they go with.


The replication features isn't exactly set up for this, so it gets a bit 
confusing. I set up the 'slave' with NO polling.  It still needs to be 
set up with config saying it's a slave though. And it still needs to 
have a 'master' URL in there, even though you can also supply/over-ride 
the master URL with a manual replicate command, if there's no master URL 
at all, Solr will refuse to start up.   So I config the master URL, but 
without any polling for changes. Then I manually issue an HTTP replicate 
command to slave only when I have a rebuilt index in master I want to 
swap in. It seems to be working.


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Mike Sokolov
If you want to avoid re-indexing, you could consider building a synonym 
file that is generated using your rule set, and then using that to 
expand your queries.  You'd need to get a list of all terms in your 
index and then process them to generate synyonyms.  Actually, I don't 
know how to get a list of all the terms without Java programming, 
though: is there a way?


-Mike

On 08/01/2011 12:35 PM, thomas wrote:

Thanks Alexei,
Thanks Paul,

I played with the solr.PhoneticFilterFactory. Analysing my query in solr
admin backend showed me how and that it is working. My major problem is,
that this filter needs to be applied to the index chain as well as to the
query chain to generate matches for our search. We have a huge index at this
point and i'am not really happy to reindex all content.

Is there maybe a more subtle solution which is working by just manipulating
the query chain only?

Otherwise i need to backup the whole index and try to reindex overnight when
cms users are sleeping.

I will have a look into the ColognePhonetic encoder. Im just afraid ill have
to reindex the whole content there as well.

Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/German-language-specific-problem-automatic-Spelling-correction-automatic-Synonyms-tp3216278p3216414.html
Sent from the Solr - User mailing list archive at Nabble.com.
   


Re: German language specific problem (automatic Spelling correction, automatic Synonyms ?)

2011-08-01 Thread Jonathan Rochkind

On 8/1/2011 1:40 PM, Mike Sokolov wrote:
If you want to avoid re-indexing, you could consider building a 
synonym file that is generated using your rule set, and then using 
that to expand your queries.  You'd need to get a list of all terms in 
your index and then process them to generate synyonyms.  Actually, I 
don't know how to get a list of all the terms without Java 
programming, though: is there a way?


The terms compoennt will give you a list of all terms, I think. 
http://wiki.apache.org/solr/TermsComponent


But this is getting awfully hacky and hard to maintain simply to avoid 
doing a re-index. I still think doing a re-index is a normal part of 
evolving your Solr configuration, and better to just get used to it (and 
figure out how to do it in production with no or minimal downtime) now.