I don't quite follow everything here (examples?), but I believe IDF of a term is not a per-field value, but "index-wide". Does that change the arguments for this proposal then?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Friday, February 29, 2008 11:52:07 AM > Subject: RE: Proposition of a new feature: Dynamic Field Types > > Thanks for your response Grant. > > You are right, depending of the language we could index the text in a > specific field. At request time, we would then ask all the fields for the > query. > > I see however a few possible problems with this approach. By order of > decreasing importance: > > - Influence on relevance > > I assume the idf is calculated on a field by field basis? In the context of > one field per language, the documents whose language is the less present in > the index will receive an unusual boost for cross-lingual tokens. This > situation can be quite frequent as the distribution of languages in the > index is usually heterogeneous. Even if it was homogeneous, we would have > the problem with rare text in one language citing words in another. > > On the other hand, you are right in the sense that the idf of language > specific words is also altered. In the context of one field for all > languages, the idf could be very low for a word if it is a common word in > another language. For example, the world "thé" in French is quite rare, but > its idf would be greatly altered by the word "the" in English. > > We have a dilemma here... > > - Performance > > Queries are in O(log n) if I'm not mistaken? Then a disjunction query on x > language fields would be nearly x times slower, no? > > - Verbose configuration > > Not an important point, but with the dynamic field type, you configure only > one time all the languages. Otherwise, you must do so for each text field. > > The query handler configuration would also be much more verbose. We usually > use the dismax handler and the qf could become very long. > > - Highlight > > Not an important point either, but a bit of work need to be done to > aggregate the results. > > In conclusion, the choice is not so clear for me. Your remark on the > relevance made me think a bit more on multilingual problems. There may be a > way to tune the idf of some fields depending on others? > > Another idea would be to boost documents in the language of the request. > This may be actually much simpler. > > If you have any idea on the subject I'm very interested! > > Nicolas > > > -----Message d'origine----- > De : Grant Ingersoll [mailto:[EMAIL PROTECTED] > Envoyé : vendredi 29 février 2008 14:06 > À : solr-user@lucene.apache.org > Objet : Re: Proposition of a new feature: Dynamic Field Types > > Why can't you choose the proper field in your application and keep > separate fields per language? Putting them all in the same field, > regardless of language, is not a good idea in my opinion because it is > more than likely going to skew your statistics and lower your relevance. > > That being said, the dynamic field type is still an interesting idea. > > -Grant > > On Feb 29, 2008, at 5:56 AM, [EMAIL PROTECTED] wrote: > > > Dynamic field types are field types that act as proxies to other field > > types. The choice of the field type to use is done on a per document > > basis > > and is dependent of the values of the document's fields. > > > > The use case that led us to this feature is the indexation of > > documents in > > different languages. We use a specific analyzer for each language > > but want > > to index semantic information that is not specific to the language. > > > > For example, we would add in the index the semantic tag {co:Paris} > > for the > > expressions "Paris", "capital city of France", "the city of lights" in > > English and "Paris", "capitale de la France", "la ville lumière" in > > French. > > This allows us to provide advanced functionalities such as semantic > > and > > cross-lingual search. > > > > To do so in SOLR, we chose to index texts written in different > > languages in > > the same field, while analyzing them with different analyzers. Hence > > the > > proposition of a new feature that respond to this need: Dynamic > > Field Types. > > > > The idea of this new field type is to act as a proxy to other field > > types. > > Depending of the values of some fields of the document to index, it > > chooses > > the correct field type to use. In our situation, we use it to choose > > the > > correct language dependent field type based on the value of the > > field named > > "language". It is configured with a config similar to the following: > > > > > > ... > > > > > > > > ... > > > > > > > > > > > > name="french_ft"/> > > > > name="english_ft"/> > > > > > > > > > > The last condition is used as a catch-all if preceding conditions > > are not > > met. > > > > What do you think of this feature? > > > > Best regards, > > Nicolas Dessaigne > > > > > >