Re: string similarity measures

2008-09-04 Thread mathieu
I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA. On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <[EMAIL PROTECTED]> wrote: > let me rephrase the problem. I already have a set of bad words. I want to > avoid people inputting typos of the bad words. > for example 'shit'

Re: could I implement this scenario?

2008-09-19 Thread mathieu
Yes. You can store data in lucene index and don't search on it : your simdocid. M. On Fri, 19 Sep 2008 16:00:20 +0800 (CST), xh sun <[EMAIL PROTECTED]> wrote: > Hi all, > > How can I implemented this scenario in lucene? > > suppose every document has three fields: docid, doctext and simdocid

Re: could I implement this scenario?

2008-09-19 Thread mathieu
Lucene is just an index. Where do you wont to store your data? in a db, flatfiles, document with an url, in lucene? M. On Fri, 19 Sep 2008 16:25:27 +0800 (CST), xh sun <[EMAIL PROTECTED]> wrote: > Thank you. Mathieu. > > But the hits don't include the document doc02  i

Re: Lucene vs. Database

2008-10-01 Thread mathieu
Have a look at Compass : http://www.compass-project.org/ It's one of the easyest way to mix db and lucene. M. On Wed, 1 Oct 2008 00:43:57 -0700 (PDT), agatone <[EMAIL PROTECTED]> wrote: > > Hi, > I asked this question already on "lucene-general" list but also got > advised > to ask here too. >

Re: Lucene vs. Database

2008-10-01 Thread mathieu
Crawling a DB is not a good idea. Indexing while writing/deleting is clever. Doing it inside the DB is a solution. Java users like ORM. Compass plug Lucene indexation in the ORM's transaction. If it's wrote or deleted, Lucene is aware. Compass is opensource. M. On Wed, 1 Oct 2008 09:12:41 -0300,

Re: Combining keyword queries with database-style queries

2008-10-23 Thread mathieu
Compass handles that nicely. You can first query, lucene and building a IN (...) in your SQL db. Or you can ask your SQL first, and handling it with a bitset in Lucene. M. On Thu, 23 Oct 2008 14:27:53 +0200, Niels Ott <[EMAIL PROTECTED]> wrote: > Hi everybody, > > I need to query for documents

Re: Inquiry on Lucene Stemming

2008-12-16 Thread mathieu
you stem the search query and while indexing, so only "flash" is indexed when "flashing" is read. If you don't wont to hurt your index with half word, you can use a second index, just like for spelling : http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index M.

Re: Indexing PDF documents with structure information

2007-08-14 Thread Mathieu Lecarme
Thomas Arni a écrit : > Hello Luceners > > I have started a new project and need to index pdf documents. > There are several projects around, which allow to extract the content, > like pdfbox, xpdf and pjclassic. > > As far as I studied the FAQ's and examples, all these > tools allow simple text ex

Re: reg-ex based stop word removal

2007-08-22 Thread Mathieu Lecarme
sandeep chawla a écrit : > Hi , > > I am working on a search application . This application requires me to > implement a stop filter > using a stop word list. I have implemented a stop filter using lucene's API. > > I want to take my application one step further. > > I want to remove all the words

Re: Searching Exact Word from Index

2007-09-10 Thread Mathieu Lecarme
Laxmilal Menaria a écrit : > Hello Everyone, > > I want to search 'abc-d' as exact keyword not 'abc d'. KeywordAnalyzer can > be used for this purpose. StandradAnalyzer create different tokens for > 'abc-d' as 'abc' and 'd'. > But I can not use this, becuase I am indexing the content of a text fil

Re: Why exactly are fuzzy queries so slow?

2007-11-24 Thread Mathieu Lecarme
fuzzy are simply not indexed. If you wont to search quickly with fuzzy search, you should index word and their ngrams, it's the "do you mean" pattern. you first select used word wich share ngram with the query word, the distance is computed with levenstein, and you use this word as a synon

Re: Why exactly are fuzzy queries so slow?

2007-11-25 Thread Mathieu Lecarme
Well, javadoc: "prefixLength - length of common (non-fuzzy) prefix". So, this is some kind of "wildcard fuzzy" but not real fuzzy anymore. I understand the optimitation but right now I hardly can image a reasonable use-case. Who care whether the levenstein distance is a the beginnen, middle

Re: Apostrophe filtering in StandardFilter

2008-01-29 Thread Mathieu Lecarme
christophe blin a écrit : Hi, thanks for the pointer to the ellision filter, but I am currently stuck with lucene-core-2.2.0 found in maven2 central repository (do not contain this class). I'll watch for an upgrade to 2.3 in the future. you can backport it easily with copy-paste. M. --

Re: [Resent] Document boosting based on .. semantics?

2008-02-20 Thread Mathieu Lecarme
Markus Fischer a écrit : Hi, [Resent: guess I sent the first before I completed my subscription, just in case it comes up twice ...] the subject may be a bit weird but I couldn't find a better way to describe a problem I'm trying to solve. If I'm not mistaken, one factor of scoring is the

Re: Rebuilding Document from index?

2008-02-26 Thread Mathieu Lecarme
Yes, I've found a tester! A patch was submited for this kind of job : https://issues.apache.org/jira/browse/LUCENE-1190 And here is the svn work in progress : https://admin.garambrogne.net/subversion/revuedepresse/trunk/src/java/lexicon And the web version : https://admin.garambrogne.net/projets

Re: How do i get a text summary

2008-02-28 Thread Mathieu Lecarme
[EMAIL PROTECTED] a écrit : If you want something from an index it has to be IN the index. So, store a summary field in each document and make sure that field is part of the query. And how could one create automatically such a summary? Have a look to http://alias-i.com/lingpipe/index.h

Re: Indexing source code files

2008-02-28 Thread Mathieu Lecarme
Dharmalingam a écrit : I am working on some sort of search mechanism to link a requirement (i.e. a query) to source code files (i.e., documents). For that purpose, I indexed the source code files using Lucene. Contrary to traditional natural language search scenario, we search for code files that

Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Mathieu Lecarme
Petite Abeille a écrit : A proposal for a Lua entry for the "Google Summer of Code" '08: A Lua implementation of Lucene. For me, Lua is just a glue between C coded object, a super config file. Like used in lighttpd or WoW. Lulu will work on top of Lucy? Did I miss something? M. --

Re: SOC: Lulu, a Lua implementation of Lucene

2008-02-29 Thread Mathieu Lecarme
Grant Ingersoll a écrit : On Feb 29, 2008, at 5:39 AM, Mathieu Lecarme wrote: Petite Abeille a écrit : A proposal for a Lua entry for the "Google Summer of Code" '08: A Lua implementation of Lucene. For me, Lua is just a glue between C coded object, a super config fil

Re: Alternate spelling suggestion (was [Resent] Document boosting based on .. semantics? )

2008-02-29 Thread Mathieu Lecarme
Hi Mathieu Lecarme wrote: On a related topic, I'm also searching for a way to suggest alternate spelling of words to the user, when we found a word which is very less frequent used in the index or not in the index at all. I'm Austrian based, when I e.g. search for "r

Re: Does Lucene support partition-by-keyword indexing?

2008-03-01 Thread Mathieu Lecarme
The easiest way is to split index by Document. In Lucene, index contains Document and inverse index of Term. If you wont to put Term in different place, Document will be duplicated on each index, with only a part of their Term. How will you manage node failure in your network? They were so

Re: Does Lucene support partition-by-keyword indexing?

2008-03-02 Thread Mathieu Lecarme
Le 2 mars 08 à 03:05, 仇寅 a écrit : Hi, I agree with your point that it is easier to partition index by document. But the partition-by-keyword approach has much greater scalability over the partition-by-document approach. Each query involves communicating with constant number of nodes; whi

Re: Does Lucene support partition-by-keyword indexing?

2008-03-02 Thread Mathieu Lecarme
he documents to be indexed are not necessarily web pages. They are mostly files stored on each node's file system. Node failures are also handled by replicas. The index for each term will be replicated on multiple nodes, whose nodeIDs are near to each other. This mechanism is handled

Re: Avoid stemming to get exact word in search results

2008-03-03 Thread Mathieu Lecarme
There's no syntax to restore stemmed word. Stemming is done while reading the news, so the index never knows the complete word. I submit a patch for that : https://issues.apache.org/jira/browse/LUCENE-1190 Be careful, rssbandit use .net lucene, not the java version. M. secou a écrit : Hi,

Re: Does Lucene support partition-by-keyword indexing?

2008-03-03 Thread Mathieu Lecarme
k and diff log should be the right approach. M. 仇寅 a écrit : Hi Mathieu, You were right. In the early stage, I only intend to implement the basic TermQuery and BooleanQuery function. Fuzzy match and partial match requires more complicated algorithms. Cache consistency will certainly be my concern.

Re: bigram analysis

2008-03-03 Thread Mathieu Lecarme
Not sure, you might want to ask on Nutch. From a strict language standpoint, the notion of a stopword in my mind is a bit dubious. If the word really has no meaning, then why does the language have it to begin with? In a search context, it has been treated as of minimal use in the early da

Re: Using a thesaurus/onthology

2008-03-05 Thread Mathieu Lecarme
Borgman, Lennart a écrit : Is there any possibility to use a thesaurus or an onthology when indexing/searching with Lucene? Yes. the WordNet contrib do that. And with a token filter, it's easy to use your own. What do you wont to do? M. ---

Re: Best way to do Query inflation?

2008-03-10 Thread Mathieu Lecarme
https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java/lexicon/src/java/org/apache/lucene/lexicon/QueryUtils.java M. Itamar Syn-Hershko a écrit : Hi all, I'm looking for the best way to inflate a query, so a query like: "synchronous AND colour" -- will become something lik

Using Lucene from scripting language without any java coding

2008-03-12 Thread Mathieu Lecarme
Here is a POC about using Lucene, via Compass, from PHP or Python (other languages will come later), with only XML configuration, object notation, and native use of scripting language. http://blog.garambrogne.net/index.php?post/2008/03/11/Using-Compass-without-dirtying-its-hands-with-java It's

Re: Language identification ??

2008-03-14 Thread Mathieu Lecarme
Raghu Ram a écrit : Hi all, I guess this question is a bit off the track. Are there any language identification modules inside Lucene ??? If not can somebody please suggest me a good one. Thank You. nutch provide a tool for that, with ngram pattern, just like OO.o do it. M. ---

Re: Search against an index on a mapped drive ...

2008-03-14 Thread Mathieu Lecarme
Dragon Fly a écrit : Hi, I'd like to find out if I can do the following with Lucene (on Windows). On server A: - An index writer creates/updates the index. The index is physically stored on server A. - An index searcher searches against the index. On server B: - Maps to the index directory.

Re: Language identification ??

2008-03-14 Thread Mathieu Lecarme
Itamar Syn-Hershko a écrit : For what it worths, I did something similar in my BidiAnalyzer so I can index both Hebrew/Semitic texts and English/Latin words without switching analyzers, giving each the proper treatment. I did it simply by testing the first char and looking at its numeric value -

Re: Language identification ??

2008-03-14 Thread Mathieu Lecarme
Raghu Ram a écrit : to complicate it further ... the text for which language identification has to be done is small, in most cases a short sentence like " I like Pepsi ". Can something be done for this ? Drinking water? More seriously, if ngram pattern language guessing is too ambigous, sear

Re: Relevance

2008-03-19 Thread Mathieu Lecarme
luceneuser a écrit : Hi All, I need help on retrieving results based on relevance + freshness. As of now, i get based on either of the fields, either on relevance or freshness. how can i achieve this. Lucene retrieves results on relevance but also fetches old results too. i need more relevan

Re: Call Lucene default command line Search from PHP script

2008-03-25 Thread Mathieu Lecarme
milu07 a écrit : Hello, My machine is Ubuntu 7.10. I am working with Apache Lucene. I have done with indexer and tried with command line Searcher (the default command line included in Lucene package: http://lucene.apache.org/java/2_3_1/demo2.html). When I use this at command line: java Searcher

Re: Integrating Spell Checker contributed to Lucene

2008-03-25 Thread Mathieu Lecarme
Ivan Vasilev a écrit : Hi Guys, Has anybody integrated the Spell Checker contributed to Lucene. http://blog.garambrogne.net/index.php?post/2008/03/07/A-lexicon-approach-for-Lucene-index https://issues.apache.org/jira/browse/LUCENE-1190 I need advise from where to get free dictionary file (one

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Mathieu Lecarme
Ivan Vasilev a écrit : Thanks Mathieu for your help! The contribution that you have made to Lucene by this patch seems to be great, but the hunspell dictionary is under LGPL which the lawyer of our company does not like. It's the spell tool used by Openoffice and firefox. Data must be

Re: Integrating Spell Checker contributed to Lucene

2008-03-26 Thread Mathieu Lecarme
Ivan Vasilev a écrit : Thanks Mathieu, I tryed to checkout but without success. Anyway I can do it manually, but as the contribution is still not approved from Lucene our chiefs will not whant it to be included to our project by now. It's a right decision. I hope the third patch will be

Re: stemming in Lucene

2008-04-02 Thread Mathieu Lecarme
Wojtek H a écrit : Hi all, Snowball stemmers are part of Lucene, but for few languages only. We have documents in various languages and so need stemmers for many languages (in particular polish). One of the ideas is to use ispell dictionaries. There are ispell dicts for many languages and so thi

Re: Error tolerant text search with Lucene?

2008-04-04 Thread Mathieu Lecarme
Marjan Celikik a écrit : Hi everyone, I know that there are packages that support the "Did you mean ... ?" search features with lucene which tries to find the most suited correct-word query.. however, so far I haven't encountered the opposite search feature: given a correct query, find all docum

Re: Error tolerant text search with Lucene?

2008-04-04 Thread Mathieu Lecarme
Marjan Celikik a écrit : Mathieu Lecarme wrote: You have to iterate over your query, if it's a BooleanQuery, keep it, if it's a TermQuery, replace it with a BooleanQuery with all variants of the Term with Occur.SHOULD M. Thanks.. however I don't fully understand what

Re: Error tolerant text search with Lucene?

2008-04-04 Thread Mathieu Lecarme
Marjan Celikik a écrit : Mathieu Lecarme wrote: wever I don't fully understand what do you mean by "iterate over your query". I would like a conceptual answer how is this done with Lucene, not a technical one.. Your query is a tree, with BooleanQuery as branch and other que

Re: Indexing and Searching from within a single Document

2008-04-08 Thread Mathieu Lecarme
[EMAIL PROTECTED] a écrit : The need is: I have millions of entries in database, each entry is in such format (more or less) ID NameDescription start (number) stop(number) Currently my application uses the database to do search, queries are in the following format: Select * fr

Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Mathieu Lecarme
Use shingleFilter. I'm working on a wider SpellChecker, I'll post a third patch soon. https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java M. dreampeppers99 a écrit : Hi, I have two question about this GREAT tool.. (framework, library... "whatever") Well I decide put spe

Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Mathieu Lecarme
Le 8 avr. 08 à 18:34, Karl Wettin a écrit : dreampeppers99 skrev: 1º Why need I pass a Directory objecto (obligatory) on constructor of SpellChecker? Mainly because it is a nasty peice of code. But it does a good job. Because spellChecker use a directory to store data. It can be FSDirectory

Re: Questions about use of SpellChecker: Constructor and Simillarity...

2008-04-08 Thread Mathieu Lecarme
I'm cool :) I just think you are overcomplicating things. Yes... I can use two words and OR Suposse I query on this The Lord of Rings: Return of King The Lord of Rings: Fellowship The Lord of Rings: The Two towers The Lord of Weapons The Lord of War Suposse an user search: "The Lord of Rings

Re: designing a dictionary filter with multiple word entries

2008-04-09 Thread Mathieu Lecarme
Allen Atamer a écrit : My dictionary filter currently implements next() and everything works well when dictionary entries are replaced one-to-one. For example: Can => Canada. A problem arises when I try to replace it with more than one word. Going through next() I encounter "shutdown". But

Re: Use of Lucene for DB Search

2008-04-10 Thread Mathieu Lecarme
have a look at Compass. M. Prashant Saraf a écrit : Hi, We are planning to provide search functionality in the a web base application. Can we use Lucene for it to search data from database like oracle and MS-Sql? Thanks and Regards प्रशांत सराफ (Prashant Saraf) S

Re: Lucene index on relational data

2008-04-11 Thread Mathieu Lecarme
Have a look at Compass 2.0M3 http://www.kimchy.org/searchable-cascading-mapping/ Your multiple index will be nice for massive write. In a classical read/write ratio, Compass will be much easier. M. Rajesh parab a écrit : Hi, We are using Lucene 2.0 to index data stored inside relational dat

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-11 Thread Mathieu Lecarme
Antony Bowesman a écrit : We're planning to archive email over many years and have been looking at using DB to store mail meta data and Lucene for the indexed mail data, or just Lucene on its own with email data and structure stored as XML and the raw message stored in the file system. For so

Re: Lucene index on relational data

2008-04-11 Thread Mathieu Lecarme
Le 11 avr. 08 à 19:29, Rajesh parab a écrit : Thanks for these pointers Mathieu. We have earlier looked at Compass, but the main issue with database index is DB vendor support for BLOB locator. I understand that Oracle provides has this support to get the partial data from BLOB, but I guess

Re: Lucene index on relational data

2008-04-12 Thread Mathieu Lecarme
Regarding data and its relationships - the use case I am trying to solve is to partition my data into 2 indexes, a primary index that will contains majority of the data and it is fairly static. The secondary index will have related information for the same data set in primary index and this relate

Re: Lucene and Google Web 1T 5 Gram

2008-04-23 Thread Mathieu Lecarme
Rafael Turk a écrit : Hi Folks, I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams(single words) to five-grams) I´m loading each ngram (each row is a ngram) as an

Re: Lucene and Google Web 1T 5 Gram

2008-04-24 Thread Mathieu Lecarme
Rafael Turk a écrit : Hi Mathieu, *What do you wont to do?* An spell checker and related keyword suggestion Here is a spell checker wich I try to finalize : https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java If you wont an ngram => popularity map, just us

Re: Can I using HFS in lucene 2.3.1?

2008-04-25 Thread Mathieu Lecarme
Alex Chew a écrit : Hi, Does somebody have practice building a distributed application with lucene and Hadoop/HFS? Lucene 2.3.1 looks not explose HFSDirectory. Any advice will be appreciated. Regards, Alex have a look to Nutch. M. --

Low hits

2007-01-23 Thread DECAFFMEYER MATHIEU
am doing wrong ? Thank u. ______ Mathieu Decaffmeyer Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The inform

RE: Low hits

2007-01-23 Thread DECAFFMEYER MATHIEU
help. __ Mathieu Decaffmeyer -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 23, 2007 2:01 PM To: java-user@lucene.apache.org Subject: Re: Low hits * This message comes from the Internet Network * What ve

RE: Low hits

2007-01-25 Thread DECAFFMEYER MATHIEU
score for titles of the web pages for the people I develop. I will try Luke but for some reason I can't install it in my company, Can someone give me some suggestions on what I should do ? Thank u. __ Mathieu Decaffmeyer Web Developer Fortis B

Score

2007-01-29 Thread DECAFFMEYER MATHIEU
Hi, I have one index with one document with title "Logistics" I have a second index with the same document with title "Logistics" and other documents (some contains the word "Logistics" as well) If I execute a search title:Logistics in the first index, I have 0.31 for the document with title "Lo

Merge Hits

2007-01-29 Thread DECAFFMEYER MATHIEU
Hi, I have a table of objects Hit, I want to merge the different Hits objects of the table to have one Hits object. Is this possible ? Thank u for any help ! __ Internet communications are not secure and therefore

RE: Merge Hits

2007-01-29 Thread DECAFFMEYER MATHIEU
. -Original Message- From: Nicolas Lalevée [mailto:[EMAIL PROTECTED] Sent: Monday, January 29, 2007 12:15 PM To: java-user@lucene.apache.org Subject: Re: Merge Hits * This message comes from the Internet Network * Le Lundi 29 Janvier 2007 12:08, DECAFFMEYER MATHIEU a écrit : > Hi, I h

RE: Merge Hits

2007-01-29 Thread DECAFFMEYER MATHIEU
Network * Le Lundi 29 Janvier 2007 13:33, DECAFFMEYER MATHIEU a écrit : > Thank u for your response, > Actually I want to merge the Hits to get a better score, > For example when user enter Hello > I want to merge : > title:Hello > headlines:Hello > summary:Hello > content:H

RE: Score

2007-01-29 Thread DECAFFMEYER MATHIEU
IndexSearcher.explain(). That'll tell you why. Erik On Jan 29, 2007, at 4:43 AM, DECAFFMEYER MATHIEU wrote: > Hi, > > I have one index with one document with title "Logistics" > > I have a second index with the same document with title "Logistics" >

RE: Score

2007-01-30 Thread DECAFFMEYER MATHIEU
Mon, 29 Jan 2007 21:52:58 +0100 : From: Soeren Pekrul <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Re: Score : : DECAFFMEYER MATHIEU wrote: : > : > Both are the same document but in different indexes, : > the only difference i

RE: Score

2007-01-31 Thread DECAFFMEYER MATHIEU
equivalent ?! Thank u. ______ Mathieu Decaffmeyer Web Developer Fortis Banque Luxembourg IS Retail Banking - Web Content Management Mobile : 0032 479 / 69 . 42 . 96 -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 30, 2007

Boost

2007-01-31 Thread DECAFFMEYER MATHIEU
earch on this word I keep having a score of a bit more than 0. Why is my boost not working ? Thank u. ______ Mathieu Decaffmeyer Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. d

RE: Boost

2007-01-31 Thread DECAFFMEYER MATHIEU
Sorry I have it working ... __ Mathieu Decaffmeyer From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 11:04 AM To: java-user@lucene.apache.org Subject: Boost * This message

RE: Recreating an index

2007-01-31 Thread DECAFFMEYER MATHIEU
Hi, I have exactly the same question. Correct me if I'm wrong : it seems that I can do any I/O operations on the index while querying because of the open IndexReader. So if I had the same situation as gui (the poster of the thread), I can just delete the old index while people query on it ? Then b

RE: Score

2007-02-01 Thread DECAFFMEYER MATHIEU
Thank u Chris for your support. __ Matt -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 12:54 AM To: java-user@lucene.apache.org Subject: RE: Score * This message comes from the Internet Net

Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU
Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper

RE: Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU
closed and reopened. Erick On 2/1/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote: > > Hi, > > I have a list of filenames like > Corporate.htm > Logistics.htm > Merchant.htm > > that need to be deleted. > > For now on I give this list to my Search

RE: Deleting document by file name

2007-02-01 Thread DECAFFMEYER MATHIEU
: index.deleteDocuments(filed name, field value); _ From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 09:53 To: java-user@lucene.apache.org Subject: Deleting document by file name Hi, I have a list of filenames

Adding headlines, path

2007-02-02 Thread DECAFFMEYER MATHIEU
Hi all, I have simple questions for which I can't find an answer by googling : 1) I want to add headlines for a document : Field headlinesField = new Field("headlines", headlines, Field.Store.YES, Field.Index.TOKENIZED); But how do I separate the headlines between them ? Let's say I want to ad

RE: Adding headlines, path

2007-02-02 Thread DECAFFMEYER MATHIEU
ose are just the fields that demo uses, your application can use any field it needs, like "headlines" above. Otis - Original Message From: DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, February 2, 2007 9:03:50 AM Subject: Adding head

IDFrequency

2007-02-02 Thread DECAFFMEYER MATHIEU
Hi, The score depends of 1. the query 2. the matched document 3. the index. I don't really understand why the index must influence the score (why it ahs been implemented that way). Let's say I have this page Logistics.htm I have just one time the word "experience" in it. It will get a high sc

Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU
Hi, I need to merge indexes, if I want the user to see the changes (the merged indexes), I heard I need to close the index reader and re-open it again. But I will need to do this avery x minutes for some reasons, So I wondered what could happen if user does a query just when a re-open of the read

RE: Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU
My question is what happen when a re-opening of the reader occurs and in the same time a user does a query on the index ? And are there solutions for this. __ Matt -Original Message- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: Thursda

RE: Open & Close Reader

2007-02-22 Thread DECAFFMEYER MATHIEU
the user is executing a query"... Erick On 2/22/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote: > > My question is what happen when a re-opening of the reader occurs and in > the same time a user does a query on the index ? And are there solutions > for this. > >

Merge Indexes - addIndexes

2007-02-28 Thread DECAFFMEYER MATHIEU
Hi, I store the Lucene Index of my web applications in a file system. Oftenly, I need to add to this index another index also stored as file system. I have three questions : * What is the best way to do this ? Open an IndexReader on this newcoming index-file system and use addIndexes(IndexR

Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
Hi, While updating my index I have the following error : [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException: Lock obtain timed out: [EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock [3/1/07 9:44:19:214 CET] 76414c82 SystemErr R at org.apache.l

RE: Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
I deleted the lock file, now it seems to work ... When can such an error happen ? __ Matt From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: Thursday, March 01, 2007 9:56 AM To: java-user@lucene.apache.org

Using Stemmers

2007-03-05 Thread DECAFFMEYER MATHIEU
nalyzer = new StandardAnalyzer(); } What I want to achive is be able to use an englsih stemmer, But I can't find any methods to associate my stemmer to my Analayzer. I appreciate any help, thank u. ______ Mathieu Decaffmeyer Web Developer Fortis Banqu

RE: Plural word search

2007-03-09 Thread DECAFFMEYER MATHIEU
I needed this myself not long time ago.. Here is a piece of code to get an Analyzer that will use a tokeniez and an English stemmer, (for "bears" it will also return "bear" and vice versa) private static Analyzer createEnglishAnalyzer() { return new Analyzer() { public TokenStream tokenSt

Open / Close when Merging

2007-03-13 Thread DECAFFMEYER MATHIEU
Hi, I need to merge several indexes (I call them incremental index) with my main index. Each incremental index can contain the same url's of the main index, that's why I have a list of url's to update, that I will delete from the main index before merging with an incremental index. I have also

[Urgent] deleteDocuments fails after merging ...

2007-03-13 Thread DECAFFMEYER MATHIEU
Hi, I have put this question as "urgent" because I can notice I don't have often answers, If I'm asking the wrong way, please tell me... Before I delete a document I search it in the index to be sure there is a hit (via a Term object), When I find a hit I delete the document (with the same Term

RE: [Urgent] deleteDocuments fails after merging ...

2007-03-13 Thread DECAFFMEYER MATHIEU
you would get some valuable information from it. http://www.linuxforums.org/forum/linux-newbie/6322-asking-good-questions -2-a.html Erick On 3/13/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote: > > > Hi, > > I have put this question as "urgent" because I

RE: [Urgent] deleteDocuments fails after merging ...

2007-03-14 Thread DECAFFMEYER MATHIEU
deletion would return 0. > On 3/13/07, DECAFFMEYER MATHIEU <[EMAIL PROTECTED]> wrote: >> >> Before I delete a document I search it in the index to be sure there is a >> hit (via a Term object), >> When I find a hit I delete the document (with the same Term object), >

setBoost on Field

2007-03-30 Thread DECAFFMEYER MATHIEU
Hi, I am parsing this file called Logistics.htm I have a field named "headlines" that contains word "clients" among others. When I don't put a boost on this field, I have as score 0.06 when searching for clients. Then when I put a boost of "10", I have a score of 0.21 Yet I was expecting a score

Re: How to implement AJAX search~Lucene Search part?

2007-06-08 Thread Mathieu Lecarme
have a look of opensearch.org specification, your self-completion will work with IE7 and Firefox 2. JSON serialization is quicker than XML stuff. Be careful to limit the number of responses. A search in "test*" works very well in my project with ten thousands of documents. Begin completion onl

Re: How to implement AJAX search~Lucene Search part?

2007-06-08 Thread Mathieu Lecarme
If you do that, you enumerate every terms!!! If you use a alphabeticaly sorted collection, you can stop, when match stop, but, you have to test every terms before matching. Lucene gives you tools to match begining of a term, just use it!! M. Le 8 juin 07 à 14:57, Patrick Turcotte a écrit : H

Re: Indexing MSword Documents

2007-06-08 Thread Mathieu Lecarme
Why don't use Document? http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/ org/apache/lucene/document/Document.html HTMLDocument manage HTML stuff like encoding, header, and other specificity. Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/ org/ap

Re: How to implement AJAX search~Lucene Search part?

2007-06-09 Thread Mathieu Lecarme
You can work like with lucene spelling. A specific Index with word as Document, boost with something proportionnal of number of occurences (with log and math magic) The magical stuff is n Fields with starting ngram, not stored, no tokenized. For example, if you wont to index the word "carott",

Re: Wildcard query with untokenized punctuation (again)

2007-06-14 Thread Mathieu Lecarme
if you don't use the same tokenizer for indexing and searching, you will have troubles like this. Mixing exact match (with ") and wildcard (*) is a strange idea. Typographical rules says that you have a space after a comma, no? Your field is tokenized? M. Renaud Waldura a écrit : > My very simple

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
Your request seems to be a two steps query. First step, you select image, and then collection Second step, you sort collection. BitVector can help you? M. Antoine Baudoux a écrit : > Hi, > > I'm developping an image database. Each lucene document > representing an image contains (among ot

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
First step is to feed a Set with "collection" Second step is to sort it. With a sortedSet, you can do that, isnt'it? M. Antoine Baudoux a écrit : > Could-you be more precise? I dont understand what you mean. > > > > On 15 Jun 2007, at 17:20, Mathieu Lecarme wrote:

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
ith at most 300 elements you can sort it with strange rules. M. Antoine Baudoux a écrit : > The problem is that i want lucene to do the sorting, because the query > qould return thousands of results, and I'm displaying documents one > page at a time. > > > On 15 Jun 2007, at 17

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
e rules. >> >> M. >> >> Antoine Baudoux a écrit : >>> The problem is that i want lucene to do the sorting, because the query >>> qould return thousands of results, and I'm displaying documents one >>> page at a time. >>> >>>

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
Walt explain differently what I said. Lucene can be efficiently use for selecting objects, without sorting or scoring anything, then, with id stored in Lucene, you can sort yourself with a simple Sortable implementation. The only limit is that lucene gives you not too much results, with your

Re: Several questions about scoring/sorting + random sorting in an image/related application

2007-06-15 Thread Mathieu Lecarme
Compass use a trick to manage father-son indexation. If you index "collection", with a fields Date, wich are the newest picture inside, and putting all picture's keyword to it collection? Then, with a keyword search, you will find the collection with the most tag occurence number and date s

Re: Lucene for chinese search

2007-06-18 Thread Mathieu Lecarme
Lee Li Bin a écrit : > Hi, > > I still met problem for searching of Chinese words. > XMl file which is the datasource and analyzer has already been encoded. > Have testing on StandardAnalyzer, CJKAnalyzer, and ChineseAnalyzer, but it > still can't get any results. > > 1.do we need any encoding

  1   2   >