RE: Lock obtain timed out
Hi. I obtained this exception when I had more than one thread trying to create an IndexWriter. I solved it by placing the code using the IndexWriter in a synchronized method. Hope it will help, Gilles. -Message d'origine- De : Hohwiller, Joerg [mailto:[EMAIL PROTECTED] Envoyé : mardi 16 décembre 2003 11:37 À : [EMAIL PROTECTED] Objet : Lock obtain timed out Hi there, I have not yet got any response about my problem. While debugging into the depth of lucene (really hard to read deep insde) I discovered that it is possible to disable the Locks using a System property. When I start my application with -DdisableLuceneLocks=true, I do not get the error anymore. I just wonder if this is legal and wont cause other trouble??? As far as I could understand the source, a proper thread synchronization is done using locks on Java Objects and the index-store locks seem to be required only if multiple lucenes (in different VMs) work on the same index. In my situation there is only one Java-VM running and only one lucene is working on one index. Am I safe disabling the locking??? Can anybody tell me where to get documentation about the Locking strategy (I still would like to know why I have that problem) ??? Or does anybody know where to get an official example of how to handle concurrent index modification and searches? Tank you so much Jörg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenizing text custom way
Do you want to define expressions, i.e. a set of terms that must be intpreted as a whole ? For instance, when the Analyzer catchs time followed by out it returns time_out ? -Message d'origine- De : Dragan Jotanovic [mailto:[EMAIL PROTECTED] Envoyé : mercredi 26 novembre 2003 12:12 À : Lucene Users List Objet : Re: Tokenizing text custom way You will need to write a custom analyzer. Don't worry, though it's quite straightforward. You will also need to write a Tokenizer, but Lucene helps you a lot here. Wouldn't I achieve the same result if I index time out like time_out, using StandardAnalyzer and later if I search for time out (inside quotes) I should get proper result, but if I search for time I shouldn't get result. Is this right? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Tokenizing text custom way
Hi. You should define expressions. To define expressions, you first have to define an expression file. An expression file contains one expressions per line. For instance : time_out expert_system ... You can use any character to specify the expression link. Here, I use the underscore (_). Then, you have to build an expression loader. You can store expressions in recursives HashMap. Such HashMap must be built so that HashMap.get(word1) = HashMap, and (HashMap.get(word1)).get(word2) = null, if you want to code the expression word1_word2. In other words 'HashMap.get(a_word)' returns a hashMap containing all the successors of the word 'a_word'. So, if your expression file looks like that : time_out expert_system expert_in_information you'll have to build a loader which returns a HashMap H so that : H.keySet() = {time, expert} ((HashMap)H.get(time)).keySet = {out} ((HashMap)H.get(time)).get(out) = null // null indicates the end of the expression ((HashMap)H.get(expert)).keySet = {system, in} ((HashMap)H.get(expert)).get(system) = null ((HashMap)((HashMap)H.get(expert)).get(in)).keySet() = {information} ((HashMap)((HashMap)H.get(expert)).get(in)).get(information) = null These recursives HashMaps code the following tree : time - out - null system --- expert - null |- in - information- null Such an expression loader may be designed this way : public static HashMap getExpressionMap( File wordfile ) { HashMap result = new HashMap(); try { String line = null; LineNumberReader in = new LineNumberReader(new FileReader(wordfile)); HashMap hashToAdd = null; while ((line = in.readLine()) != null) { if (line.startsWith(FILE_COMMENT_CHARACTER)) continue; if (line.trim().length() == 0) continue; StringTokenizer stok = new StringTokenizer(line, \t_); String curTok = ; HashMap currentHash = result; // Test wether the expression contains 2 at least words or not if (stok.countTokens() 2) { System.err.println(Warning : ' + line + ' in file ' + wordfile.getAbsolutePath() + ' line + in.getLineNumber() + is not an expression.\n\tA valid expression contains at least 2 words.); continue; } while (stok.hasMoreTokens()) { curTok = stok.nextToken(); if (curTok.startsWith(FILE_COMMENT_CHARACTER)) // if comment at the end of the line, break break; if (stok.hasMoreTokens()) hashToAdd = new HashMap(6); else hashToAdd = (HashMap)null; if (!(currentHash.containsKey(curTok))) currentHash.put(curTok, hashToAdd); currentHash = (HashMap)currentHash.get(curTok); } } return result; } // On error, use an empty table catch ( Exception e ) { System.err.println(While processing ' + wordfile.getAbsolutePath() + ' : + e.getMessage()); e.printStackTrace(); return new HashMap(); } } Then, you must build a filter with 2 FIFO stacks : one is the expression stack, the other is the default stack. Then, you define a 'curMap' variable, initially pointing onto the HashMap returned by the ExpressionFileLoader. When you receive a token, you check wether it is null or not; If it is, you check if the standard stack is null or not. If it is not, you pop a token from the default stack and you return it. If it is, you return null If it is not (the token is not null), you check whether it
RE: Document ID's and duplicates
Hi. You just have to add a field in your document object before adding it to the index. The field should be of keyword type. You can use a code of that kind : IndexWriter writer = new IndexWriter(path_to_your_index, your_analyzer_object); Document doc = new Document(); doc.add(Field.keyword(id), (String)pkey); // add an id field containig the pkey value (received from the db for instance) // you can add other fields here writer.addDocument(doc). writer.optimize(); writer.close(); Gilles. -Message d'origine- De : jt oob [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 novembre 2003 15:43 À : Lucene-Users-List Objet : Document ID's and duplicates Hi folks, I've got a feeling the answer to this has either been posted on here recently, or is on the site somewhere - but i can't find it. Apologies if i'm going over old ground. What is the best way force documents to be only indexed once? Is it a case of having a field with a unique value for the document and searching the index for that field before adding? If that is the way to do it, would it be a good idea to add an additional field type which would take care of this behind the scenes? Many people move to lucene after discovering the downfalls of text searching in Databases (like me), and would love a primary key type field. Regards, jt Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document ID's and duplicates
If you want to replace a possibly existant document, you've got to - check whether it exists or not let's assume your Lucene's primary key field is called id First, open an IndexReader IndexReader ir = IndexReader.open(your_index_path); Then, check the TermEnum associated with the value of the primary key you're looking for (let's assume it s called pkey) : TermDocs terms = ir.termDocs(new org.apache.lucene.index.Term(LUCENE_FIELD_ID, pkey)); if your keys are really primaries, the enumeration will be void or contain one element. So : if (terms.next()) //terms is not empty = the pkey has been found = the document already exists = we'll have to delete it before adding it ir.delete(terms.doc()); // delete the current document in the TermDocs enumeration // Here, you perform the normal addition Gilles -Message d'origine- De : Don Kaiser [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 novembre 2003 18:14 À : Lucene Users List Objet : RE: Document ID's and duplicates If you do this will the old version of the document be replaced by the new one? -don -Original Message- From: MOYSE Gilles (Cetelem) [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 19, 2003 6:57 AM To: 'Lucene Users List' Subject: RE: Document ID's and duplicates Hi. You just have to add a field in your document object before adding it to the index. The field should be of keyword type. You can use a code of that kind : IndexWriter writer = new IndexWriter(path_to_your_index, your_analyzer_object); Document doc = new Document(); doc.add(Field.keyword(id), (String)pkey); // add an id field containig the pkey value (received from the db for instance) // you can add other fields here writer.addDocument(doc). writer.optimize(); writer.close(); Gilles. -Message d'origine- De : jt oob [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 novembre 2003 15:43 À : Lucene-Users-List Objet : Document ID's and duplicates Hi folks, I've got a feeling the answer to this has either been posted on here recently, or is on the site somewhere - but i can't find it. Apologies if i'm going over old ground. What is the best way force documents to be only indexed once? Is it a case of having a field with a unique value for the document and searching the index for that field before adding? If that is the way to do it, would it be a good idea to add an additional field type which would take care of this behind the scenes? Many people move to lucene after discovering the downfalls of text searching in Databases (like me), and would love a primary key type field. Regards, jt __ __ Want to chat instantly with your online friends? Get the FREE Yahoo! Messenger http://mail.messenger.yahoo.co.uk - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boost in Query Parser
Hello. I've made a Filter which recognizes special words and return them in a boosted form, in a QueryParser sense. For instance, when the filter receives special_word, it returns special_word^3, so as to boost it. The problem is that the QueryParser understands the boost syntax when the string is given as an argument to the parse function, but does not get it when it is generated by a filter in the Analyzer. So, when my filter transforms special_word to special_filter^3, the QueryParser does not create a Query object with special_word as value to look for and boost to 3, but with special_word^3 to search and boost to 1. Of course, it does not match anything. Does anyone knows a solution to that problem ? Do I have to write my own QueryParser from the beginning or do I just have to correct 2 ot 3 lines of the original QueryParser to make it work the I'd like it to work ? Thanks a lot. Gilles Moyse. -Message d'origine- De : Erik Hatcher [mailto:[EMAIL PROTECTED] Envoyé : mercredi 12 novembre 2003 15:16 À : Lucene Users List Objet : Re: Can use Lucene be used for this On Wednesday, November 12, 2003, at 07:34 AM, Hackl, Rene wrote: col2 like %aa% Lucene doesn't handle queries where the start of the term is not known very efficiently. Is it really able to handle them at all? I thought *foo-type queries were not supported. They are not supported by the QueryParser, but an API created WildcardQuery supports it. I certainly do not recommend using prefix-style wildcard queries though, knowing what happens under the covers. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: multiple tokens from a single input token
Hi. I experienced the same problem, and I used the following solution (maybe not the good one, but it works, and not too slowly). The problem was to detect synonyms. I used a synonyms file, made up of that kind of lines : a b c d e f to define a, b, and c as synonyms, and d, e and f as ohter synonyms. So, when my filter received a token 'b' for instance, I wanted it to return three tokens, 'a', 'b' and 'c'. I used a FIFO stack to solve that. When the filter receives a token, it checks whether the stack is empty or not. If it is, then it returns the received token. If it is not empty, then it returns the poped (i.e. the first which was pushed. It's better to use a FIFO stack to keep a correct order) value from the stack. When you receive the 'null' token, indicating the end of stream, then you continue returning the poped values from yoour stack until it is empty. Then you return 'null'. Hope it will help. Gilles Moyse -Message d'origine- De : Peter Keegan [mailto:[EMAIL PROTECTED] Envoyé : lundi 10 novembre 2003 15:43 À : Lucene user's list Objet : re: multiple tokens from a single input token I would appreciate some clarification on how to generate multiple tokens from a single input token. In a previous message: (see: http://www.mail-archive.com/[EMAIL PROTECTED]/msg04875.html), Pierrick Brihaye provides the following code: public final Token next() throws IOException { while (true) { String emittedText; int positionIncrement = 0; //New token ? if (receivedText.length() == 0) { receivedToken = input.next(); if (receivedToken == null) return null; receivedText.append(receivedToken.termText()); positionIncrement = 1; } emittedText = getNextPart(); //Warning : all tokens are emitted with the *same* offset if (emittedText.length() 0) { Token emittedToken = new Token(emittedText, receivedToken.startOffset(), receivedToken.endOffset()); emittedToken.setPositionIncrement(positionIncrement); return emittedToken; } } } I assume that you would extend the TokenFilter class and override the 'next' method. But what I don't understand is how you return more than one Token (with different settings for 'setpositionIncrement') if the 'next' method is only called once for each input token. For example, when my custom filter's 'next()' method receives token 'A' from 'DocumentWriter.invertDocument()', it wants to return token 'A' and token 'B' at the same postion. How is this done? It seems I can only return one token at a time from 'next()'. I think I'm missing something obvious :-( Thanks, Peter
Compound expression extraction
Hi. I'm trying to extract expressions from the terms position information, i.e., if two words appears frequently side-by-side, then we can consider that the two words are only one. For instance, 'Object' and 'Oriented' appears side-by-side 9 times out of 10. It allows us to define a new expression, 'Object_Oriented'. Does anyone knows the statistical method to detect such expressions ? Thanks. Gilles Moyse -Message d'origine- De : Eric Jain [mailto:[EMAIL PROTECTED] Envoyé : mardi 21 octobre 2003 09:24 À : Lucene Users List Objet : Re: Lucene on Windows The CVS version of Lucene has a patch that allows one to use a 'Compound Index' instead of the traditional one. This reduces the number of open files. For more info, see/make the Javadocs for IndexWriter. Interesting option. Do you have a rough idea of what the performance impact of using this setting is? -- Eric Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Expression Extractions
I've found something about expression extractions (the ability , when a word and another appear frequently side-by-side, to detect that they form an expression) : http://www.miv.t.u-tokyo.ac.jp/papers/matsuoFLAIRS03.pdf Gilles Moyse
RE: Does the Lucene search engine work with PDF's?
You can also use the TextMining.org toolbox, which provides classes to extract text from PDF and DOC files, using the Jakarta POI project. They are all free, under Apache Licence. The URL :http://www.textmining.org/modules.php?op=modloadname=Newsfile=articlesid =6mode=threadorder=0thold=0). (URL tested today) You can try the JGuru page : http://www.jguru.com/faq/view.jsp?EID=1074237 Gilles Moyse -Message d'origine- De : Andre Hughes [mailto:[EMAIL PROTECTED] Envoyé : samedi 18 octobre 2003 00:05 À : [EMAIL PROTECTED] Objet : Does the Lucene search engine work with PDF's? Hello, Can the Lucene search engine index and search though PDF documents? What are the file format limits for Lucene search engine. Thanks in Advance, Andre'
RE: Indexing UTF-8 and lexical errors
Hi. You should edit the StandardTokenizer.jj file. It contains all the definitions to generate the StandardTokenizer.java class, that you certainly use. At the end of the StandardTokenizer.jj file, you'll find the definition of the LETTER token. You'll see all the accepted letters, in Unicode. If you want a table of the different Unicodes, go there : http://www.alanwood.net/unicode/ In the LETTER token definition in the .jj file, unicode are coded as ranges (like \u0030-\u0039) or as elements (like \u00f1). Adding the Arabic unicode ranges in this part may solve your problem (add a line like \u0600-\u06FF, since 0600-06FF is the range for arabic characters) Once modified, go to the root of your Lucene installation, and recompile the StandardTokenizer.jj file with : ant compile It should generate the java files (and even compile if I remember well) Good Luck Gilles Moyse -Message d'origine- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoyé : mardi 14 octobre 2003 12:07 À : [EMAIL PROTECTED] Objet : Indexing UTF-8 and lexical errors I am trying to index UTF-8 encoded HTML files with content in various languages with Lucene. So far I always receive a message Parse Aborted: Lexical error at line 146, column 79. Encountered: \u2013 (8211), after : when trying to index files with Arabic words. I am aware of the fact that tokenizing/analyzing/stemming non-latin characters has some issues but for me tokenizing would be enough. And that should work with Arabic, Russian etc. shouldn't it ? So, what steps do I have to take to make Lucene index non-latin languages/characters encoded in UTF-8 ? Thank you very much, Matthias - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]