Re: WordNet code updated, now with query expansion -- Re: SYNONYM + GOOGLE
Hi, David Spencer a écrit : Do you plan to add expansion on other Wordnet relationships ? Hypernyms and hyponyms would be a good start point for thesaurus-like search, wouldn't it ? Good point, I hadn't considered this - but how would it work -just consider these 2 relationships synonyms (thus easier to use) or make it separate (too academic?) Well... the ideal case would be (easy) customization :-), form an external text (XML ?) file. Depending of the kind of relationship, the boost factor could be adjusted when the query is expanded. The same on relationships' depths. For example a father hypernym could have a boost factor of 0.8, a grand-father a boost factor of 0.4, a grand-grand-father a boost factor of 0.2. Well, I wonder whether a logarithmic scale makes a better sense than a linear scale, but this should/would be customizable... However, I'm afraid that this kind of feature would require refactoring, probably based on WordNet-dedicated libraries. JWNL (http://jwordnet.sourceforge.net/) may be a good candidate for this. Good point, should leverage existing code. One thing you can also easily get from this library are Wordnet's exceptions, often irregular plurals (mouse/mice, addendum/addenda...). A very basic yet efficient kind of stemming which should be expanded with the same boost factor than the original term. Well, there are many other relationships in WordNet. Take a look at : http://jws-champo.ac-toulouse.fr:8080/treebolic-wordnet/ legends are here : http://treebolic.sourceforge.net/en/browserwn.htm Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] +33 (0)2 99 29 67 78 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: (Offtopic) The unicode name for a character
Hi, Morus Walter a écrit : If you cannot find that list somewhere I can mail you a copy. ICU4J's one is here : http://oss.software.ibm.com/cvs/icu4j/icu4j/src/com/ibm/icu/dev/data/unicode/UnicodeData.txt?rev=1.7content-type=text/x-cvsweb-markup See also Unicode's one: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt http://pistos.pe.kr/javadocs/etc/icu4j2_4/doc/com/ibm/icu/lang/UCharacter.html#getName(int) should also help you. However, I don't think that the names are consistent enough to permit a generic use of regular expressions. What Daniel is trying to achieve looks interesting anyway, Good luck, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] +33 (0)2 99 29 67 78 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Aramorph Analyzer
Hi, Sorry, I (the aramorph maintainer ;-) was absent from the office... Daniel Naber a crit : Analyzers that provide ambiguous terms (i.e. a token with more than one term at the same position) don't work in Lucene 1.4. The is the correct answer. I've filled a bug about this : http://issues.apache.org/bugzilla/show_bug.cgi?id=23307 This feature has only recently been added to CVS. ... and I thank you very much for this commit. Notice however that you may experiment some problems with the query parser because Buckwalter's arabic transliteration uses the standard * joker character as a representation for dhal. Notice also that aramorph has a mailing-list for such questions : http://lists.nongnu.org/mailman/listinfo/aramorph-users Cheers, -- Pierrick Brihaye, informaticien Service rgional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] +33 (0)2 99 29 67 78 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
Hi, Scott Smith a écrit : Is anyone aware of an open source (non-GPL; i.e.., free for commercial use) Arabic analyzer for Lucene? Unfortunately (for you), my Arabic Analyzer for Java (http://savannah.nongnu.org/projects/aramorph) is GPL-ed. Does Arabic really require a stemmer as well (some of the reading I've seen on the web would suggest that a stemmer is almost a necessity with Arabic to get anything useful where it is not with other languages). IMHO, stemming *is* a necessity in arabic since this language involves prefixing, suffixing and infixing as well as written a few yet very frequent word agregations. Good luck, -- Pierrick Brihaye mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Russian analyzer in Luke
Hi Ivan, I tryed Luke to try to search in my Lucene database and discovered that when I try to select russian analyzer it shows me next error: -- java.lang.NoSuchMethodException: org.apache.lucene.analysis.ru.RussianAnalyzer.init() at java.lang.Class.getConstructor0(Unknown Source) at java.lang.Class.getConstructor(Unknown Source) at luke.Luke.createQueryParser(Luke.java:809) Remember that Luke can be launched in 2 ways : 1) A standalone JAR, containing Luke and Lucene 1.3-final: lukeall.jar 2) As two separate JARs, one containing Luke and the other pristine Lucene 1.3-final JAR (just signed, so that it can be used with Java WS) ... Remember to put both JARs on your classpath, e.g.: java -classpath luke.jar;lucene.jar luke.Luke It looks like you are using the second one and that your lucene.jar does not contain the org.apache.lucene.analysis.ru.RussianAnalyzer class. Cheers, p.b. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple tokens from a single input token
Hi, MOYSE Gilles (Cetelem) a écrit: I experienced the same problem, and I used the following solution (maybe not the good one, but it works, and not too slowly). The problem was to detect synonyms. I used a synonyms file, made up of that kind of lines : a b c d e f Mmmh... 1 for 1. The question was deliberatly a 1 to N tokenization. Anyway... I used a FIFO stack to solve that. Yes : the token stack does the trick. My code was actually a token stack but... less beautiful (and more generic) than the code provided just a bit later :-) When the filter receives a token, it checks whether the stack is empty or not. If it is, then it returns the received token. If it is not empty, then it returns the poped (i.e. the first which was pushed. It's better to use a FIFO stack to keep a correct order) value from the stack. When you receive the 'null' token, indicating the end of stream, then you continue returning the poped values from yoour stack until it is empty. Then you return 'null'. That's it. Please do notice that the stack is necessarily declared outside of the next() method, i.e. it is an global instance variable. Maybe Peter Keegan missed this point ? Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: positional token info
Hi, Erik Hatcher a écrit: Is anyone doing anything interesting with the Token.setPositionIncrement during analysis? I think so :-) Well... my arabic analyzer is based on this functionnality. The basic idea is to have several tokens at the same position (i.e. setPositionIncrement(0)) which are different possible stems for the same word. But its practically impossible to formulate a Query that can take advantage of this. A PhraseQuery, because Terms don't have positional info (only the transient tokens) Correct ! I've made a dirty patch for the QueryParser which is able to handle tokens with positionIncrement equal to 0 or 1 (see bug #23307). It still needs some work, but it fits my needs :-) I certainly see the benefit of putting tokens into zero-increment positions, but are increments of 2 or more at all useful? Who knows ? I may be interesting to keep track of the *presence* of empty words, e.g. [the] sky [is] blue, [the] sky [is] [really] blue, [the] sky [is] [that] [really] blue. The traditionnal reduction to sky blue is maybe over-simplistic for some cases... Well, just an idea. Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: derive tokens from single token
Hi, Hackl, Rene a écrit: I tried to extend TokenFilter, but all I get is either oobar or obar, depends on when 'return' is called. How could I add such extra tokens to the tokenStream? Any thoughts on this appreciated. Adapted from my... arabic Analyzer : public class ChemicalOrSomethingFilter extends TokenFilter { private Token receivedToken = null; private StringBuffer receivedText = new StringBuffer(); public ChemicalOrSomethingFilter(TokenStream input){ super(input); } private String getNextTruncation() { StringBuffer emittedText = new StringBuffer(); //left trim the token while(true) { if (receivedText.length() == 0) break; char c = receivedText.charAt(0); if (!Character.isWhitespace(c)) break; receivedText.deleteCharAt(0); } //keep the good stuff while(true) { if (receivedText.length() == 0) break; char c = receivedText.charAt(0); if (Character.isWhitespace(c)) break; emittedText.append(receivedText.charAt(0)); receivedText.deleteCharAt(0); } //right trim the token while(true) { if (receivedText.length() == 0) break; char c = receivedText.charAt(0); if (!Character.isWhitespace(c)) break; receivedText.deleteCharAt(0); } return emittedText.toString(); } public final Token next() throws IOException { while (true) { String emittedText; int positionIncrement = 0; //New token ? if (receivedText.length() == 0) { receivedToken = input.next(); if (receivedToken == null) return null; receivedText.append(receivedToken.termText()); positionIncrement = 1; } emittedText = getNextPart(); //Warning : all tokens are emitted with the *same* offset if (emittedText.length() 0) { Token emittedToken = new Token(emittedText, receivedToken.startOffset(), receivedToken.endOffset()); emittedToken.setPositionIncrement(positionIncrement); return emittedToken; } } } } Not tested at all : it is a quick copy of my WhiteSpaceFilter (that's why triming is so important up there ;-) which is not the best designed class. This should work for indexing. For querying, it's another matter especially if you want to use the queryParser. Keep us informed. Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: derive tokens from single token
Hi, MOYSE Gilles (Cetelem) a écrit: isn't this one more secure ? //New token ? if (receivedToken == null) return null; if (receivedText.length() == 0) { receivedToken = input.next(); receivedText.append(receivedToken.termText()); positionIncrement = 1; } I don't think so. The aim of this method is to substream the main stream :-) i.e. output several tokens when just one is received (see thread's object). In other terms, we shall not consume a token until the current token is itself entirely consumed, i.e. receivedText.length() == 0. When the currentToken is consumed, we shall immediately return null if we receive a null Token (i.e. EOS). That's why this statement is *inside* a successful test for current token consumption. I must reckognize that the use of a string buffer is maybe not the best way to do. I must also reckognize that I have to be *very* confident in the getNextTruncation() method :-) Well, my code snippet was to demonstrate : 1) how a substream can be handled (remember : I tried to extend TokenFilter, but all I get is either oobar or obar, depends on when 'return' is called) 2) how these tokens will be meited at the same position, thus permitting efficient queries. Cheers, -- Pierrick Brihaye, informaticien Service régional de l'Inventaire DRAC Bretagne mailto:[EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announce : arabic Stemmer/Analyzer for Lucene
Hi, We could put this in the Lucene sandbox CVS perhaps. Why not ? Could you package it similarly to the other contributions there with a build file Yes... but you'll have to wait :-) and convert your command-line tests to JUnit tests that run from the build file? And also on this point. The 2 CLI programs are rather demonstration programs than real test cases that could demonstrate the current pending issues. I took a quick look and looks like you did a fair bit of work and have the ASL in the source files. Yes... at least in the source files that are based on my own work. The question, though, is whether your basing it on GPL code is acceptable. Did you copy code from it? As I said, it is based on Tim's Buckwalter work : original Perl program as well as those precious dictionary files. We can have no GPL code in Apache's CVS. :-/ How can we do, so ? Shall I split the packages in two parts ? No problems for the Lucene bindings. But there could be one for the aramorph package (java port of the original work), which is based on work originally ruled by the GPL... Cheers, p.b. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Announce : arabic Stemmer/Analyzer for Lucene
Hi, Is it possible to contact Tim, I did it soon after I posted the announcment. and ask if he will allow you to license his code under an Apache style license? Many authors are accomodating with licensing software under different licenses. It's true but... I have personal worries about including GPL code in any commercial application (even dynamically linked). ... so do I :-) Thanks for the advices (more to come on Monday I presume). I think it will help to take my decision. Cheers, p.b. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]