Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread Otis Gospodnetic
That would be a partial solution.  Accents will not be a problem any
more, but if you use an Analyzer than stems tokens, they will not rally
be tokenized properly.  Searches will probably work, but if you look at
the index you will see that some terms were not analyzed properly.  But
it may be sufficient for your needs, so try just with accent removal.

Otis


--- "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:

> Morus Walter said the following on 1/21/2005 2:14 AM:
> 
> > No. You could do a ( ( french-query ) or ( english-query ) )
> construct 
> > using
> >
> >one query. So query construction would be a bit more complex but
> querying
> >itself wouldn't change.
> >
> >The first thing I'd do in your case would be to look at the
> differences
> >in the output of english and french snowball stemmer.
> >I don't speak any french, but probably you might even use both
> stemmers
> >on all texts.
> >
> >Morus
> >
> 
> I've done some thinking afterwards, and instead of messing with
> complex 
> queries, would it make sense to
> replace all "special" characters such as "é", "è" with "e" during 
> indexing (I suppose write a custom analyzer)
> and then during searching parse the query and replace all occurances
> of 
> special characters (if any) with their
> normal latin equivalents?
> 
> This should produce the required results, no? Since the index would
> not 
> contain any French characters and
> searching for French words would return them since they were indexed
> as 
> normal words.
> 
> -pedja
> 
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: English and French documents together / analysis, indexing, searching

2005-01-23 Thread [EMAIL PROTECTED]
Morus Walter said the following on 1/21/2005 2:14 AM:
No. You could do a ( ( french-query ) or ( english-query ) ) construct 
using

one query. So query construction would be a bit more complex but querying
itself wouldn't change.
The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.
Morus
I've done some thinking afterwards, and instead of messing with complex 
queries, would it make sense to
replace all "special" characters such as "é", "è" with "e" during 
indexing (I suppose write a custom analyzer)
and then during searching parse the query and replace all occurances of 
special characters (if any) with their
normal latin equivalents?

This should produce the required results, no? Since the index would not 
contain any French characters and
searching for French words would return them since they were indexed as 
normal words.

-pedja



Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Morus Walter
[EMAIL PROTECTED] writes:
> 
> > you could try to create a more complex query and expand it into both 
> > languages using different analyzers. Would this solve your problem ?
> >
> Would that mean I would have to actually conduct two searches (one in 
> English and one in French) then merge the results and display them to 
> the user?
No. You could do a ( ( french-query ) or ( english-query ) ) construct using
one query. So query construction would be a bit more complex but querying
itself wouldn't change.

The first thing I'd do in your case would be to look at the differences
in the output of english and french snowball stemmer.
I don't speak any french, but probably you might even use both stemmers
on all texts.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer

you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?

Would that mean I would have to actually conduct two searches (one in 
English and one in French) then merge the results and display them to 
the user?
It sounds to me like a long way around, so then actually writing an 
analyzer that has the language guesser might be a better solution on 
the long run?
It's no problem to guess the language based on the document corpus. But 
how do you want to guess the language of a simple Term Query ? What if 
your users are searching for names like "George Bush" ? You can't guess 
the language of such a query and you have to expand it into both 
languages. I don't see an easier way for solving that problem.


This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer 
does not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based 
tokenizer.

Hmm I feel this is beyond my abilities at the moment, writing my own 
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change 
it based on other tokenziers available in Lucene such as 
WhiteSpaceTokenizer.
What's about using the WhitespaceAnalyzer directly ? Maybe this fits 
more into your requirement and you could use it for both languages.

Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]

you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?

Would that mean I would have to actually conduct two searches (one in 
English and one in French) then merge the results and display them to 
the user?
It sounds to me like a long way around, so then actually writing an 
analyzer that has the language guesser might be a better solution on the 
long run?

This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer 
does not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based 
tokenizer.
Hmm I feel this is beyond my abilities at the moment, writing my own 
tokenizer, without more in-depth knowledge of everything else.
Perhaps I'll try taking the StandardTokenizer and expand it or change it 
based on other tokenziers available in Lucene such as WhiteSpaceTokenizer.

thanks
-pedja
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer

Right now I am using StandardAnalyzer but the results are not what I'd 
hope for. Also since my understanding is that we should use the same 
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply 
to the SnowBall analyzer I wouldn't be able to use SnowBall for 
searching because users want to search through both
English and French and I suppose I would not get the same results if 
used with StandardAnalyzer?
you could try to create a more complex query and expand it into both 
languages using different analyzers. Would this solve your problem ?


Another problem with StandardAnalyzer is that it breaks up some words 
that should not be broken (in our case document identifiers such as 
ABC-1234 etc) but that's a secondary issue...
This is a behaviour is implemented in StandardTokenizer used by 
StandardAnalyzer. Look at the documentation of StandardTokenizer:

Many applications have specific tokenizer needs.  If this tokenizer does
not suit your application, please consider copying this source code
directory to your project and maintaining your own grammar-based tokenizer.
Bernhard
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]
Right now I am using StandardAnalyzer but the results are not what I'd 
hope for. Also since my understanding is that we should use the same 
analyzer for searching that was used for indexing,
even if I can manage to guess the language during indexing and apply to 
the SnowBall analyzer I wouldn't be able to use SnowBall for searching 
because users want to search through both
English and French and I suppose I would not get the same results if 
used with StandardAnalyzer?

Another problem with StandardAnalyzer is that it breaks up some words 
that should not be broken (in our case document identifiers such as 
ABC-1234 etc) but that's a secondary issue...

thanks
-pedja

Bernhard Messer said the following on 1/20/2005 1:05 PM:
i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
want to use the snowball stemmers, you have to add a language guesser 
to get the language for the particular document before creating the 
analyzer.

regards
Bernhard
[EMAIL PROTECTED] schrieb:
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but 
there are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular 
search. Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have 
to specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time 
during searching (users would like to search in both document types).

There's also the issue of letter accents in french words and 
searching for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: English and French documents together / analysis, indexing, searching

2005-01-20 Thread Bernhard Messer
i think the easiest way ist to use Lucene's StandardAnalyzer. If you 
want to use the snowball stemmers, you have to add a language guesser to 
get the language for the particular document before creating the analyzer.

regards
Bernhard
[EMAIL PROTECTED] schrieb:
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but there 
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search. 
Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have 
to specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during 
searching (users would like to search in both document types).

There's also the issue of letter accents in french words and searching 
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


English and French documents together / analysis, indexing, searching

2005-01-20 Thread [EMAIL PROTECTED]
Greetings everyone
I wonder is there a solution for analyzing both English and French 
documents using the same analyzer.
Reason being is that we have predominantly English documents but there 
are some French, yet it all has to go into the same index
and be searchable from the same location during any perticular search. 
Is there a way to analyze both types of documents with
a same analyzer (and which one)?

I've looked around and I see there's a SnowBall analyzer but you have to 
specify the language of analysis, and I do not know that
ahead of time during indexing nor do I know it most of the time during 
searching (users would like to search in both document types).

There's also the issue of letter accents in french words and searching 
for the same (how are they indexed at the first place even)?
Has anyone dealt with this before and how did you solve the problem?

thanks
-pedja

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]