Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE

2005-01-25 Thread Pierrick Brihaye
Hi,
David Spencer a écrit :
Do you plan to add expansion on other Wordnet relationships ? 
Hypernyms and hyponyms would be a good start point for thesaurus-like 
search, wouldn't it ?
Good point, I hadn't considered this - but how would it work -just 
consider these 2 relationships synonyms (thus easier to use) or make 
it separate (too academic?)
Well... the ideal case would be (easy) customization :-), form an 
external text (XML ?) file. Depending of the kind of relationship, the 
boost factor could be adjusted when the query is expanded. The same on 
relationships' depths.

For example a father hypernym could have a boost factor of 0.8, a 
grand-father a boost factor of 0.4, a grand-grand-father a boost 
factor of 0.2. Well, I wonder whether a logarithmic scale makes a better 
sense than a linear scale, but this should/would be customizable...

However, I'm afraid that this kind of feature would require 
refactoring, probably based on WordNet-dedicated libraries. JWNL 
(http://jwordnet.sourceforge.net/) may be a good candidate for this.
Good point, should leverage existing code.
One thing you can also easily get from this library are Wordnet's 
exceptions, often irregular plurals (mouse/mice, addendum/addenda...). 
A very basic yet efficient kind of stemming which should be expanded 
with the same boost factor than the original term.

Well, there are many other relationships in WordNet. Take a look at :
http://jws-champo.ac-toulouse.fr:8080/treebolic-wordnet/
legends are here :
http://treebolic.sourceforge.net/en/browserwn.htm
Cheers,
--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: (Offtopic) The unicode name for a character

2004-12-22 Thread Pierrick Brihaye
Hi,
Morus Walter a écrit :
If you cannot find that list somewhere I can mail you a copy.
ICU4J's one is here :
http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7content-type=text/x-cvsweb-markup
See also Unicode's one:
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
http://pistos.pe.kr/javadocs/etc/icu4j2_4/doc/com/ibm/icu/lang/UCharacter.html#getName(int) 
should also help you.

However, I don't think that the names are consistent enough to permit a 
generic use of regular expressions. What Daniel is trying to achieve 
looks interesting anyway,

Good luck,
--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Aramorph Analyzer

2004-12-20 Thread Pierrick Brihaye
Hi,
Sorry, I (the aramorph maintainer ;-) was absent from the office...
Daniel Naber a crit :
Analyzers that provide ambiguous terms (i.e. a token with more than one term 
at the same position) don't work in Lucene 1.4.
The is the correct answer. I've filled a bug about this : 
http://issues.apache.org/bugzilla/show_bug.cgi?id=23307

This feature has only 
recently been added to CVS.
... and I thank you very much for this commit.
Notice however that you may experiment some problems with the query 
parser because Buckwalter's arabic transliteration uses the standard * 
joker character as a representation for dhal.

Notice also that aramorph has a mailing-list for such questions :
http://lists.nongnu.org/mailman/listinfo/aramorph-users
Cheers,
--
Pierrick Brihaye, informaticien
Service rgional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
+33 (0)2 99 29 67 78
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Arabic analyzer

2004-10-19 Thread Pierrick Brihaye
Hi,
Scott Smith a écrit :
Is anyone aware of an open source (non-GPL; i.e.., free for commercial
use) Arabic analyzer for Lucene?
Unfortunately (for you), my Arabic Analyzer for Java 
(http://savannah.nongnu.org/projects/aramorph) is GPL-ed.

 Does Arabic really require a stemmer
as well (some of the reading I've seen on the web would suggest that a
stemmer is almost a necessity with Arabic to get anything useful where
it is not with other languages).
IMHO, stemming *is* a necessity in arabic since this language involves 
prefixing, suffixing and infixing as well as written a few yet very 
frequent word agregations.

Good luck,
--
Pierrick Brihaye
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Using Russian analyzer in Luke

2004-01-24 Thread Pierrick Brihaye
Hi Ivan,

   I tryed Luke to try to search in my Lucene database and discovered
   that when I try to select russian analyzer it shows me next error:
   --
 java.lang.NoSuchMethodException:
org.apache.lucene.analysis.ru.RussianAnalyzer.init()
 at java.lang.Class.getConstructor0(Unknown Source)
 at java.lang.Class.getConstructor(Unknown Source)
 at luke.Luke.createQueryParser(Luke.java:809)

Remember that Luke can be launched in 2 ways :

1)
A standalone JAR, containing Luke and Lucene 1.3-final: lukeall.jar

2)
As two separate JARs, one containing Luke and the other pristine Lucene
1.3-final JAR (just signed, so that it can be used with Java WS)
...
Remember to put both JARs on your classpath, e.g.: java -classpath
luke.jar;lucene.jar luke.Luke

It looks like you are using the second one and that your lucene.jar does
not contain the org.apache.lucene.analysis.ru.RussianAnalyzer class.

Cheers,

p.b.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: multiple tokens from a single input token

2003-11-10 Thread Pierrick Brihaye
Hi,

MOYSE Gilles (Cetelem) a écrit:

I experienced the same problem, and I used the following solution (maybe not
the good one, but it works, and not too slowly).
The problem was to detect synonyms. I used a synonyms file, made up of that
kind of lines :
a b c
d e f
Mmmh... 1 for 1. The question was deliberatly a 1 to N tokenization. 
Anyway...

I used a FIFO stack to solve that.
Yes : the token stack does the trick. My code was actually a token 
stack but... less beautiful (and more generic) than the code provided 
just a bit later :-)

When the filter receives a token, it checks whether the stack is empty or
not. If it is, then it returns the received token. If it is not empty, then
it returns the poped (i.e. the first which was pushed. It's better to use a
FIFO stack to keep a correct order) value from the stack.
When you receive the 'null' token, indicating the end of stream, then you
continue returning the poped values from yoour stack until it is empty. Then
you return 'null'.
That's it.

Please do notice that the stack is necessarily declared outside of the 
next() method, i.e. it is an global instance variable. Maybe Peter 
Keegan missed this point ?

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: positional token info

2003-10-21 Thread Pierrick Brihaye
Hi,

Erik Hatcher a écrit:

Is anyone doing anything interesting with the Token.setPositionIncrement 
during analysis?
I think so :-) Well... my arabic analyzer is based on this functionnality.

The basic idea is to have several tokens at the same position (i.e. 
setPositionIncrement(0)) which are different possible stems for the same 
word.

But its practically impossible to formulate a Query that can take 
advantage of this.  A PhraseQuery, because Terms don't have positional 
info (only the transient tokens)
Correct !

I've made a dirty patch for the QueryParser which is able to handle 
tokens with positionIncrement equal to 0 or 1 (see bug #23307). It still 
needs some work, but it fits my needs :-)

I certainly see the benefit of putting tokens into zero-increment 
positions, but are increments of 2 or more at all useful?
Who knows ? I may be interesting  to keep track of the *presence* of 
empty words, e.g. [the] sky [is] blue, [the] sky [is] [really] 
blue, [the] sky [is] [that] [really] blue. The traditionnal reduction 
to sky blue is maybe over-simplistic for some cases...

Well, just an idea.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: derive tokens from single token

2003-09-29 Thread Pierrick Brihaye
Hi,

Hackl, Rene a écrit:

I tried to extend TokenFilter, but 
all I get is either oobar or obar, depends on when 'return' is called. 

How could I add such extra tokens to the tokenStream? Any thoughts on this
appreciated.
Adapted from my... arabic Analyzer :

public class ChemicalOrSomethingFilter extends TokenFilter {

  private Token receivedToken = null;
  private StringBuffer receivedText = new StringBuffer();   

  public ChemicalOrSomethingFilter(TokenStream input){
super(input);
  }

  private String getNextTruncation() {  
StringBuffer emittedText = new StringBuffer();
//left trim the token
while(true) {
  if (receivedText.length() == 0) break;
  char c = receivedText.charAt(0);
  if (!Character.isWhitespace(c)) break;
  receivedText.deleteCharAt(0); 
}   
//keep the good stuff
while(true) {
  if (receivedText.length() == 0) break;
  char c = receivedText.charAt(0);
  if (Character.isWhitespace(c)) break;
  emittedText.append(receivedText.charAt(0));
  receivedText.deleteCharAt(0); 
}   
//right trim the token
while(true) {
 if (receivedText.length() == 0) break;
 char c = receivedText.charAt(0);
 if (!Character.isWhitespace(c)) break;
 receivedText.deleteCharAt(0);  
}   
return emittedText.toString();
  }
  public final Token next() throws IOException {
while (true) {
  String emittedText;	
  int positionIncrement = 0;	
  //New token ?
  if (receivedText.length() == 0) {
receivedToken = input.next();			
if (receivedToken == null) return null;
receivedText.append(receivedToken.termText());		
positionIncrement = 1;		
  }
  emittedText = getNextPart();			
  //Warning : all tokens are emitted with the *same* offset
  if (emittedText.length()  0) {
Token emittedToken = new Token(emittedText, 
receivedToken.startOffset(), receivedToken.endOffset());			 
emittedToken.setPositionIncrement(positionIncrement);
return emittedToken;
  }
}
  }
}

Not tested at all : it is a quick copy of my WhiteSpaceFilter (that's 
why triming is so important up there ;-) which is not the best designed 
class.

This should work for indexing. For querying, it's another matter 
especially if you want to use the queryParser.

Keep us informed.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: derive tokens from single token

2003-09-29 Thread Pierrick Brihaye
Hi,

MOYSE Gilles (Cetelem) a écrit:

isn't this one more secure ?

   //New token ?
if (receivedToken == null) return null;
if (receivedText.length() == 0) {
 receivedToken = input.next();  
 receivedText.append(receivedToken.termText()); 
 positionIncrement = 1; 
   }
I don't think so. The aim of this method is to substream the main 
stream :-) i.e. output several tokens when just one is received (see 
thread's object).

In other terms, we shall not consume a token until the current token is 
itself entirely consumed, i.e. receivedText.length() == 0.

When the currentToken is consumed, we shall immediately return null if 
we receive a null Token (i.e. EOS). That's why this statement is 
*inside* a successful test for current token consumption.

I must reckognize that the use of a string buffer is maybe not the best 
way to do. I must also reckognize that I have to be *very* confident in 
the getNextTruncation() method :-)

Well, my code snippet was to demonstrate :

1) how a substream can be handled (remember : I tried to extend 
TokenFilter, but all I get is either oobar or obar, depends on when 
'return' is called)

2) how these tokens will be meited at the same position, thus permitting 
efficient queries.

Cheers,

--
Pierrick Brihaye, informaticien
Service régional de l'Inventaire
DRAC Bretagne
mailto:[EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Announce : arabic Stemmer/Analyzer for Lucene

2003-09-28 Thread Pierrick Brihaye
Hi,

 We could put this in the Lucene sandbox CVS perhaps.

Why not ?

 Could you package
 it similarly to the other contributions there with a build file

Yes... but you'll have to wait :-)

 and
 convert your command-line tests to JUnit tests that run from the build
 file?

And also on this point. The 2 CLI programs are rather demonstration
programs than real test cases that could demonstrate the current pending
issues.

 I took a quick look and looks like you did a fair bit of work and have
 the ASL in the source files.

Yes... at least in the source files that are based on my own work.

 The question, though, is whether your
 basing it on GPL code is acceptable.  Did you copy code from it?

As I said, it is based on Tim's Buckwalter work : original Perl program as
well as those precious dictionary files.

 We can have no GPL code in Apache's CVS.

:-/ How can we do, so ? Shall I split the packages in two parts ? No
problems for the Lucene bindings. But there could be one for the aramorph
package (java port of the original work), which is based on work originally
ruled by the GPL...

Cheers,

p.b.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Announce : arabic Stemmer/Analyzer for Lucene

2003-09-28 Thread Pierrick Brihaye
Hi,

 Is it possible to contact Tim,

I did it soon after I posted the announcment.

 and ask if he will allow you to license
 his code under an Apache style license?  Many authors are accomodating
 with licensing software under different licenses.

It's true but...

 I have personal worries about including GPL code in any commercial
 application (even dynamically linked).

... so do I :-)

Thanks for the advices (more to come on Monday I presume). I think it will
help to take my decision.

Cheers,

p.b.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]