RE: Lock obtain timed out

2003-12-16 Thread MOYSE Gilles (Cetelem)
Hi.

I obtained this exception when I had more than one thread trying to create
an IndexWriter.
I solved it by placing the code using the IndexWriter in a synchronized
method.

Hope it will help,

Gilles.

-Message d'origine-
De : Hohwiller, Joerg [mailto:[EMAIL PROTECTED]
Envoyé : mardi 16 décembre 2003 11:37
À : [EMAIL PROTECTED]
Objet : Lock obtain timed out


Hi there,

I have not yet got any response about my problem.

While debugging into the depth of lucene (really hard to read deep insde) I 
discovered that it is possible to disable the Locks using a System property.

When I start my application with -DdisableLuceneLocks=true, 
I do not get the error anymore.

I just wonder if this is legal and wont cause other trouble???
As far as I could understand the source, a proper thread 
synchronization is done using locks on Java Objects and
the index-store locks seem to be required only if multiple 
lucenes (in different VMs) work on the same index.
In my situation there is only one Java-VM running and only one
lucene is working on one index. 

Am I safe disabling the locking???
Can anybody tell me where to get documentation about the Locking
strategy (I still would like to know why I have that problem) ???

Or does anybody know where to get an official example of how to
handle concurrent index modification and searches?

Tank you so much
  Jörg

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Tokenizing text custom way

2003-11-26 Thread MOYSE Gilles (Cetelem)
Do you want to define expressions, i.e. a set of terms that must be
intpreted as a whole ?
For instance, when the Analyzer catchs time followed by out it returns
time_out ?


-Message d'origine-
De : Dragan Jotanovic [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 26 novembre 2003 12:12
À : Lucene Users List
Objet : Re: Tokenizing text custom way


 You will need to write a custom analyzer.  Don't worry, though it's
 quite straightforward.  You will also need to write a Tokenizer, but
 Lucene helps you a lot here.

Wouldn't I achieve the same result if I index time out like time_out,
using StandardAnalyzer and later if I search for time out (inside quotes)
I should get proper result, but if I search for time I shouldn't get
result. Is this right?




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Tokenizing text custom way

2003-11-25 Thread MOYSE Gilles (Cetelem)
Hi.

You should define expressions.
To define expressions, you first have to define an expression file.
An expression file contains one expressions per line.
For instance :
time_out
expert_system
...
You can use any character to specify the expression link. Here, I use the
underscore (_).

Then, you have to build an expression loader. You can store expressions in
recursives HashMap.
Such HashMap must be built so that HashMap.get(word1) = HashMap, and
(HashMap.get(word1)).get(word2) = null, if you want to code the
expression word1_word2.
In other words 'HashMap.get(a_word)' returns a hashMap containing all the
successors of the word 'a_word'.

So, if your expression file looks like that :
time_out
expert_system
expert_in_information

you'll have to build a loader which returns a HashMap H so that :
H.keySet() = {time, expert}
((HashMap)H.get(time)).keySet = {out}
((HashMap)H.get(time)).get(out) = null // null indicates the end
of the expression
((HashMap)H.get(expert)).keySet = {system, in}
((HashMap)H.get(expert)).get(system) = null
((HashMap)((HashMap)H.get(expert)).get(in)).keySet() =
{information}
((HashMap)((HashMap)H.get(expert)).get(in)).get(information) =
null

These recursives HashMaps code the following tree :
time - out - null
system --- expert - null
  |- in - information- null

Such an expression loader may be designed this way :

public static HashMap getExpressionMap( File wordfile ) {
HashMap result = new HashMap();

try 
{
String line = null;
LineNumberReader in = new LineNumberReader(new
FileReader(wordfile));
HashMap hashToAdd = null;

while ((line = in.readLine()) != null)
{
if (line.startsWith(FILE_COMMENT_CHARACTER))
continue;

if (line.trim().length() == 0)
continue;

StringTokenizer stok = new
StringTokenizer(line,  \t_);
String curTok = ;
HashMap currentHash = result;

// Test wether the expression contains 2 at
least words or not
if (stok.countTokens()  2)
{
System.err.println(Warning : ' +
line + ' in file ' + wordfile.getAbsolutePath() + ' line  +
in.getLineNumber() +
 is not an expression.\n\tA
valid expression contains at least 2 words.);
continue;
}

while (stok.hasMoreTokens())
{
curTok = stok.nextToken();
if
(curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the
line, break
break;
if (stok.hasMoreTokens())
hashToAdd = new HashMap(6);
else
hashToAdd = (HashMap)null;

if
(!(currentHash.containsKey(curTok)))
currentHash.put(curTok,
hashToAdd);

currentHash =
(HashMap)currentHash.get(curTok);
}
}
return result;
}
// On error, use an empty table
catch ( Exception e ) 
{
System.err.println(While processing ' +
wordfile.getAbsolutePath() + ' :  + e.getMessage());
e.printStackTrace();
return new HashMap();
}
}


Then, you must build a filter with 2 FIFO stacks : one is the expression
stack, the other is the default stack.
Then, you define a 'curMap' variable, initially pointing onto the HashMap
returned by the ExpressionFileLoader.

When you receive a token, you check wether it is null or not;
If it is, you check if the standard stack is null or not.
If it is not, you pop a token from the default stack and you
return it.
If it is, you return null
If it is not (the token is not null), you check whether it 

RE: Document ID's and duplicates

2003-11-19 Thread MOYSE Gilles (Cetelem)
Hi.

You just have to add a field in your document object before adding it to the
index.
The field should be of keyword type. You can use a code of that kind :

IndexWriter writer = new IndexWriter(path_to_your_index,
your_analyzer_object);  

Document doc = new Document();
doc.add(Field.keyword(id), (String)pkey); // add an id field
containig the pkey value (received from the db for instance)
// you can add other fields here

writer.addDocument(doc).
writer.optimize();
writer.close();

Gilles.
-Message d'origine-
De : jt oob [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 novembre 2003 15:43
À : Lucene-Users-List
Objet : Document ID's and duplicates


Hi folks,

I've got a feeling the answer to this has either been posted on here
recently, or is on the site somewhere - but i can't find it. Apologies
if i'm going over old ground.

What is the best way force documents to be only indexed once?

Is it a case of having a field with a unique value for the document and
searching the index for that field before adding?

If that is the way to do it, would it be a good idea to add an
additional field type which would take care of this behind the scenes?
Many people move to lucene after discovering the downfalls of text
searching in Databases (like me), and would love a primary key type
field.

Regards,
jt


Want to chat instantly with your online friends?  Get the FREE Yahoo!
Messenger http://mail.messenger.yahoo.co.uk

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Document ID's and duplicates

2003-11-19 Thread MOYSE Gilles (Cetelem)
If you want to replace a possibly existant document, you've got to

- check whether it exists or not
let's assume your Lucene's primary key field is called id
First, open an IndexReader
IndexReader ir = IndexReader.open(your_index_path);

Then, check the TermEnum associated with the value of the primary
key you're looking for (let's assume it s called pkey) :
TermDocs terms = ir.termDocs(new
org.apache.lucene.index.Term(LUCENE_FIELD_ID, pkey));

if your keys are really primaries, the enumeration will be void or
contain one element.
So :
if (terms.next()) //terms is not empty = the pkey has been found =
the document already exists = we'll have to delete it before adding it
ir.delete(terms.doc()); // delete the current document in
the TermDocs enumeration

// Here, you perform the normal addition


Gilles


-Message d'origine-
De : Don Kaiser [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 novembre 2003 18:14
À : Lucene Users List
Objet : RE: Document ID's and duplicates


If you do this will the old version of the document be replaced by the new
one?

-don

 -Original Message-
 From: MOYSE Gilles (Cetelem) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 19, 2003 6:57 AM
 To: 'Lucene Users List'
 Subject: RE: Document ID's and duplicates
 
 
 Hi.
 
 You just have to add a field in your document object before 
 adding it to the
 index.
 The field should be of keyword type. You can use a code of 
 that kind :
 
   IndexWriter writer = new IndexWriter(path_to_your_index,
 your_analyzer_object);
 
   Document doc = new Document();
   doc.add(Field.keyword(id), (String)pkey); // add an id field
 containig the pkey value (received from the db for instance)
   // you can add other fields here
 
   writer.addDocument(doc).
   writer.optimize();
   writer.close();
 
 Gilles.
 -Message d'origine-
 De : jt oob [mailto:[EMAIL PROTECTED]
 Envoyé : mercredi 19 novembre 2003 15:43
 À : Lucene-Users-List
 Objet : Document ID's and duplicates
 
 
 Hi folks,
 
 I've got a feeling the answer to this has either been posted on here
 recently, or is on the site somewhere - but i can't find it. Apologies
 if i'm going over old ground.
 
 What is the best way force documents to be only indexed once?
 
 Is it a case of having a field with a unique value for the 
 document and
 searching the index for that field before adding?
 
 If that is the way to do it, would it be a good idea to add an
 additional field type which would take care of this behind the scenes?
 Many people move to lucene after discovering the downfalls of text
 searching in Databases (like me), and would love a primary key type
 field.
 
 Regards,
 jt
 
 __
 __
 Want to chat instantly with your online friends?  Get the FREE Yahoo!
 Messenger http://mail.messenger.yahoo.co.uk
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Boost in Query Parser

2003-11-12 Thread MOYSE Gilles (Cetelem)
Hello.

I've made a Filter which recognizes special words and return them in a
boosted form, in a QueryParser sense.
For instance, when the filter receives special_word, it returns
special_word^3, so as to boost it.
The problem is that the QueryParser understands the boost syntax when the
string is given as an argument to the parse function, but does not get it
when it is generated by a filter in the Analyzer.
So, when my filter transforms special_word to special_filter^3, the
QueryParser does not create a Query object with special_word as value to
look for and boost to 3, but with special_word^3 to search and boost to 1.
Of course, it does not match anything.

Does anyone knows a solution to that problem ? Do I have to write my own
QueryParser from the beginning or do I just have to correct 2 ot 3 lines of
the original QueryParser to make it work the I'd like it to work ?

Thanks a lot.

Gilles Moyse.

-Message d'origine-
De : Erik Hatcher [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 12 novembre 2003 15:16
À : Lucene Users List
Objet : Re: Can use Lucene be used for this


On Wednesday, November 12, 2003, at 07:34  AM, Hackl, Rene wrote:
 col2 like %aa%

 Lucene doesn't handle queries where the start of the term is not known
 very efficiently.

 Is it really able to handle them at all? I thought *foo-type queries 
 were
 not supported.

They are not supported by the QueryParser, but an API created 
WildcardQuery supports it.

I certainly do not recommend using prefix-style wildcard queries 
though, knowing what happens under the covers.

Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: multiple tokens from a single input token

2003-11-10 Thread MOYSE Gilles (Cetelem)
Hi.

I experienced the same problem, and I used the following solution (maybe not
the good one, but it works, and not too slowly).
The problem was to detect synonyms. I used a synonyms file, made up of that
kind of lines :
a b c
d e f

to define a, b, and c as synonyms, and d, e and f as ohter synonyms.
So, when my filter received a token 'b' for instance, I wanted it to return
three tokens, 'a', 'b' and 'c'.
I used a FIFO stack to solve that.
When the filter receives a token, it checks whether the stack is empty or
not. If it is, then it returns the received token. If it is not empty, then
it returns the poped (i.e. the first which was pushed. It's better to use a
FIFO stack to keep a correct order) value from the stack.
When you receive the 'null' token, indicating the end of stream, then you
continue returning the poped values from yoour stack until it is empty. Then
you return 'null'.

Hope it will help.

Gilles Moyse

-Message d'origine-
De : Peter Keegan [mailto:[EMAIL PROTECTED]
Envoyé : lundi 10 novembre 2003 15:43
À : Lucene user's list
Objet : re: multiple tokens from a single input token


I would appreciate some clarification on how to generate multiple tokens
from a single input token.

In a previous message: (see:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04875.html),
Pierrick Brihaye provides the following code:

public final Token next() throws IOException {
while (true) {
String emittedText; 
int positionIncrement = 0; 
//New token ?
if (receivedText.length() == 0) {
receivedToken = input.next(); 
if (receivedToken == null) return null;
receivedText.append(receivedToken.termText()); 
positionIncrement = 1; 
}
emittedText = getNextPart(); 
//Warning : all tokens are emitted with the *same* offset
if (emittedText.length()  0) {
Token emittedToken = new Token(emittedText, receivedToken.startOffset(),
receivedToken.endOffset());
emittedToken.setPositionIncrement(positionIncrement);
return emittedToken;
} 
}
}
I assume that you would extend the TokenFilter class and override the 'next'
method. But what I don't understand is how you return more than one Token
(with different settings for 'setpositionIncrement') if the 'next' method is
only called once for each input token.

For example, when my custom filter's 'next()' method receives token 'A' from
'DocumentWriter.invertDocument()', it wants to return token 'A' and token
'B' at the same postion. How is this done? It seems I can only return one
token at a time from 'next()'. I think I'm missing something obvious :-(

Thanks,
Peter


Compound expression extraction

2003-10-21 Thread MOYSE Gilles (Cetelem)
Hi.

I'm trying to extract expressions from the terms position information, i.e.,
if two words appears frequently side-by-side, then we can consider that the
two words are only one. For instance, 'Object' and 'Oriented' appears
side-by-side 9 times out of 10. It allows us to define a new expression,
'Object_Oriented'.
Does anyone knows the statistical method to detect such expressions ?

Thanks.

Gilles Moyse

-Message d'origine-
De : Eric Jain [mailto:[EMAIL PROTECTED]
Envoyé : mardi 21 octobre 2003 09:24
À : Lucene Users List
Objet : Re: Lucene on Windows


 The CVS version of Lucene has a patch that allows one to use a
 'Compound Index' instead of the traditional one.  This reduces the
 number of open files.  For more info, see/make the Javadocs for
 IndexWriter.

Interesting option. Do you have a rough idea of what the performance
impact of using this setting is?

--
Eric Jain


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Expression Extractions

2003-10-21 Thread MOYSE Gilles (Cetelem)
I've found something about expression extractions (the ability , when a word
and another appear frequently side-by-side, to detect that they form an
expression) : http://www.miv.t.u-tokyo.ac.jp/papers/matsuoFLAIRS03.pdf

Gilles Moyse


RE: Does the Lucene search engine work with PDF's?

2003-10-20 Thread MOYSE Gilles (Cetelem)
You can also use the TextMining.org toolbox, which provides classes to
extract text from PDF and DOC files, using the Jakarta POI project. They are
all free, under Apache Licence. 

The URL
:http://www.textmining.org/modules.php?op=modloadname=Newsfile=articlesid
=6mode=threadorder=0thold=0).
(URL tested today) 

You can try the JGuru page : http://www.jguru.com/faq/view.jsp?EID=1074237

Gilles Moyse


-Message d'origine-
De : Andre Hughes [mailto:[EMAIL PROTECTED]
Envoyé : samedi 18 octobre 2003 00:05
À : [EMAIL PROTECTED]
Objet : Does the Lucene search engine work with PDF's?


Hello,
Can the Lucene search engine index and search though PDF documents?
What are the file format limits for Lucene search engine.
 
Thanks in Advance,
 
Andre'


RE: Indexing UTF-8 and lexical errors

2003-10-14 Thread MOYSE Gilles (Cetelem)
Hi.

You should edit the StandardTokenizer.jj file. It contains all the
definitions to generate the StandardTokenizer.java class, that you certainly
use.
At the end of the StandardTokenizer.jj file, you'll find the definition of
the LETTER token. You'll see all the accepted letters, in Unicode. If you
want a table of the different Unicodes, go there :
http://www.alanwood.net/unicode/
In the LETTER token definition in the .jj file, unicode are coded as ranges
(like \u0030-\u0039) or as elements (like \u00f1).
Adding the Arabic unicode ranges in this part may solve your problem (add a
line like \u0600-\u06FF, since 0600-06FF is the range for arabic
characters)

Once modified, go to the root of your Lucene installation, and recompile the
StandardTokenizer.jj file with :
ant compile
It should generate the java files (and even compile if I remember well)

Good Luck

Gilles Moyse

-Message d'origine-
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoyé : mardi 14 octobre 2003 12:07
À : [EMAIL PROTECTED]
Objet : Indexing UTF-8 and lexical errors



I am trying to index UTF-8 encoded HTML files with content in various
languages with Lucene. So far I always receive a message

Parse Aborted: Lexical error at line 146, column 79.
Encountered: \u2013 (8211), after :  

when trying to index files with Arabic words. I am aware of the fact
that tokenizing/analyzing/stemming non-latin characters has some issues
but for me tokenizing would be enough. And that should work with Arabic,
Russian etc. shouldn't it ?

So, what steps do I have to take to make Lucene index non-latin
languages/characters encoded in UTF-8 ?

Thank you very much,
Matthias


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]