Re: boosting & StandardAnalyzer, stop words
Perhaps we'd better continue this on lucene-dev. Ok, i will subscribe this list and request again. Thanks! Stefan -- open technology: http://www.media-style.com open source: http://www.weta-group.net open discussion: http://www.text-mining.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Rebuild index?
What version are you running? Something removed one of the files that Lucene needs while it was using the index. Could there have been something running in the background and cleaning /tmp? On Wed, Dec 10, 2003 at 01:43:42PM +0200, Igor Semenko wrote: > Hello, > > We use lucene to search menus, there are around 1 items in > index and sometimes I see error like this: > (/tmp/index-menu is index directory) > java.io.FileNotFoundException: /tmp/index-menu/_6q2.prx (No such file or directory) > at java.io.RandomAccessFile.open(Native Method) > at java.io.RandomAccessFile.(RandomAccessFile.java:204) > at > org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:389) > at org.apache.lucene.store.FSInputStream.(FSDirectory.java:418) > at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291) > at org.apache.lucene.index.SegmentReader.(SegmentReader.java:132) > at org.apache.lucene.index.SegmentReader.(SegmentReader.java:103) > at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:119) > at org.apache.lucene.store.Lock$With.run(Lock.java:148) > at org.apache.lucene.index.IndexReader.open(IndexReader.java:110) > ... > > When I just rebuild the index the problem is gone. > Could someone hint what can be the reason of such a strange behavior? > > -- > Thanks, > Igor Semenko, > http://www.webfood.us > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Dror Matalon Zapatec Inc 1700 MLK Way Berkeley, CA 94709 http://www.fastbuzz.com http://www.zapatec.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good and performance and fuzzy search
Erik Hatcher <[EMAIL PROTECTED]>: >> >>You've got quite a tough task ahead of you I think. You >>originally >>said you wanted to limit documents, which is what a >>Filter does. But a Ok, I need to have again some english teaching so. >>FuzzyQuery still needs to go over all the terms, >>otherwise how would it >>know if there was a match or not before even considering >>the documents. >> I don't know, I didn't searched deeply in the code to know the behaviour and the meaning of all classes, but apparently I need to do... >>It'll be interesting to see what solution you come up >>with. >> >> Erik me to :) Julien. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good and performance and fuzzy search
On Wednesday, December 10, 2003, at 05:27 PM, julien gerard wrote: But in this case the fuzzy is performed on the overall index? The QueryFilter do his job after ? I'm not sure to understand the QueryFilter meaning? But I test the QueryFilter also this way and the time to doing this search it's the same. The fuzzy is time consuming, this is normal, so I'm searching a solution to having less term to compare with fuzzy algorithm. I'm checking the FuzzyTermEnum class and searching how to redifine this to implement a FuzzySubsetTermEnum with constructor : FuzzySubsetTermEnum(IndexReader reader, Term term, Term subset) For retrieving only the term which also You've got quite a tough task ahead of you I think. You originally said you wanted to limit documents, which is what a Filter does. But a FuzzyQuery still needs to go over all the terms, otherwise how would it know if there was a match or not before even considering the documents. It'll be interesting to see what solution you come up with. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Query Parser AND / OR
What Morus is saying is right, an expression without parenthesis, when interpreted, assumes terms on either side of an AND clause are compulsory terms, and any terms on either side of an OR clause are optional. However, if you combine AND and OR in an expression, the optional terms have no effect because the others are compulsory. What needs to be done is that the query parse should process any query string that has AND, and "put brackets" round it first. As it stands it is no use, as the OR does not work in the way you would think. AND should be given implicit priority. -Original Message- From: Morus Walter [mailto:[EMAIL PROTECTED] Sent: 10 December 2003 09:01 To: Lucene Users List Subject: Re: Query Parser AND / OR Hi Dror, thanks for your answer. > > > > I'm having problems understanding query parsers handling of AND and OR > > if there's more than one operator. > > > > E.g. > > a OR b AND c > > gives the same number of hits as > > b AND c > > (only scores are different) > > This would make sense if all the document that have a also have both B > and C in them. > Then the query should be equivalent to (a OR b) AND c. But it isn't. For specific a, b and c I get 766 hits for a OR b AND c and 1086 for (a OR b) AND c. > > > > and > > a AND b OR c AND d > > seems to be equivalent to > > a AND b AND C AND d > > > a OR b AND c -> a +b +c 4 documents found a b c a b c d b c b c d (a OR b) AND c -> +(a b) +c 6 documents found - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good and performance and fuzzy search
But in this case the fuzzy is performed on the overall index? The QueryFilter do his job after ? I'm not sure to understand the QueryFilter meaning? But I test the QueryFilter also this way and the time to doing this search it's the same. The fuzzy is time consuming, this is normal, so I'm searching a solution to having less term to compare with fuzzy algorithm. I'm checking the FuzzyTermEnum class and searching how to redifine this to implement a FuzzySubsetTermEnum with constructor : FuzzySubsetTermEnum(IndexReader reader, Term term, Term subset) For retrieving only the term which also Erik Hatcher wrote : >> >>QueryFilter would do the trick if you instead used the >>query you handed >>to it to be the one to single out a "sub-category". It >>would limit the >>documents searched to just the sub-category, and the >>fuzzy search would >>be done using IndexSearcher.search, only handing it the >>filter then as >>well. >> >>Will this scheme work for you? >> >> Erik >> >> >>- >>To unsubscribe, e-mail: >>[EMAIL PROTECTED] >>For additional commands, e-mail: >>[EMAIL PROTECTED] >> >> - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Good and performance and fuzzy search
On Wednesday, December 10, 2003, at 04:07 PM, julien gerard wrote: I'm attempting to optimize a fuzzy search on a big index with ~4.400.000 Documents ( lucene's meanning ) in 600.000 sub-categories (Simple Text.Keyword type a field ). My purpose is to limit the amount of documents on wich the fuzzy search with levenhstein disance is performed ( an user cannot search on the 600.000 sub-categories but on 1 to 3 max ) the classics lucenes ways to do that are not adapted to my case : - multiple indexes : having 600.000 indexes is a nightmare for maintenance. - QueryFilter is not adapted because it's the fuzzy search which is in The QueryFilter and the number of different request is too important, so I cannot reuse the same. - The BooleanQuery with 'AND' parameter is also not adapted because the two search are executed and after the results are merged. QueryFilter would do the trick if you instead used the query you handed to it to be the one to single out a "sub-category". It would limit the documents searched to just the sub-category, and the fuzzy search would be done using IndexSearcher.search, only handing it the filter then as well. Will this scheme work for you? Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Good and performance and fuzzy search
Hi, I'm attempting to optimize a fuzzy search on a big index with ~4.400.000 Documents ( lucene's meanning ) in 600.000 sub-categories (Simple Text.Keyword type a field ). My purpose is to limit the amount of documents on wich the fuzzy search with levenhstein disance is performed ( an user cannot search on the 600.000 sub-categories but on 1 to 3 max ) the classics lucenes ways to do that are not adapted to my case : - multiple indexes : having 600.000 indexes is a nightmare for maintenance. - QueryFilter is not adapted because it's the fuzzy search which is in The QueryFilter and the number of different request is too important, so I cannot reuse the same. - The BooleanQuery with 'AND' parameter is also not adapted because the two search are executed and after the results are merged. So ( Ah!!! ) my first question is : is there any way to do fuzzy search on a subset of the index that I've not seen yet? Is this solution does not exist? Which solution could I implemented to perform this kind of search? I could implemented a FuzzyFilter but I'll need to access to each document, wich is time consuming. I know that solution cost a lot of memory usage, which has already been discuted on this list, but in my case this way is the only I can see to decrease the execution time. regards, Julien. PS. : Sorry for my poor english. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query expansion
How do you model/store your taxonomies/ontologies regarding your datastructure ? Do you use Java datastructures or RDF? Cheers, Ralf > Hi Everybody, > > I wish to use an hierarchy of concept provided by an Ontology to refine > or expand my query answer with Lucene. > May I Know If someone have tryed it yet ? > > Thanks, > Gayo > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query expansion
Hi, expanding a query is basically done by generating a new one an reusing the existing terms plus the selected one from your ontology/taxonomy. There has been discussion here before and you should search the archive for that. Extracting and using the right bit from your ontology is basically a task for your programm logic and highly depends on your reasoning and choice. Cheers, Ralf > Hi Everybody, > > I wish to use an hierarchy of concept provided by an Ontology to refine > or expand my query answer with Lucene. > May I Know If someone have tryed it yet ? > > Thanks, > Gayo > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- +++ GMX - die erste Adresse für Mail, Message, More +++ Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Rebuild index?
Hello, We use lucene to search menus, there are around 1 items in index and sometimes I see error like this: (/tmp/index-menu is index directory) java.io.FileNotFoundException: /tmp/index-menu/_6q2.prx (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:204) at org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:389) at org.apache.lucene.store.FSInputStream.(FSDirectory.java:418) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:132) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:103) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:119) at org.apache.lucene.store.Lock$With.run(Lock.java:148) at org.apache.lucene.index.IndexReader.open(IndexReader.java:110) ... When I just rebuild the index the problem is gone. Could someone hint what can be the reason of such a strange behavior? -- Thanks, Igor Semenko, http://www.webfood.us - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Query expansion
Hi Everybody, I wish to use an hierarchy of concept provided by an Ontology to refine or expand my query answer with Lucene. May I Know If someone have tryed it yet ? Thanks, Gayo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Parser AND / OR
Hi Dror, thanks for your answer. > > > > I'm having problems understanding query parsers handling of AND and OR > > if there's more than one operator. > > > > E.g. > > a OR b AND c > > gives the same number of hits as > > b AND c > > (only scores are different) > > This would make sense if all the document that have a also have both B > and C in them. > Then the query should be equivalent to (a OR b) AND c. But it isn't. For specific a, b and c I get 766 hits for a OR b AND c and 1086 for (a OR b) AND c. > > > > and > > a AND b OR c AND d > > seems to be equivalent to > > a AND b AND C AND d > > > > That's not what I get. > http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+AND+clark+AND+gephardt&days= > returns 479 items > but > http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+OR+clark+AND+gephardt&days= > returns 564 items which indicates that the OR does make a difference. > As expcted, you end up getting more items with the OR. > Hmm. I was sloppy not specifying the lucene version. My tests were on 1.2. But I reindex a part of my documents using 1.3rc3 and find the same. What version does fastbuzz use? I wrote s small test programm indexing all documents consisting of one or zero occurences of a, b, c and d (ignoring order, so without the empty document, that's just 15 docs) and performing some queries on it. Programm see below, this is what I get: a OR b AND c -> a +b +c 4 documents found a b c a b c d b c b c d (a OR b) AND c -> +(a b) +c 6 documents found a b c a b c d a c b c a c d b c d a OR (b AND c) -> a (+b +c) 10 documents found a b c a b c d b c a b c d a b a c a d a b d a c d b AND c -> +b +c 4 documents found b c a b c b c d a b c d a AND b OR c AND d -> +a +b +c +d 1 documents found a b c d (a AND b) OR (c AND d) -> (+a +b) (+c +d) 7 documents found a b c d a b c d a b c a b d a c d b c d a AND (b OR c) AND d -> +a +(b c) +d 3 documents found a b c d a b d a c d ((a AND b) OR c) AND d -> +((+a +b) c) +d 5 documents found a b c d a b d c d a c d b c d a AND (b OR (c AND d)) -> +a +(b (+c +d)) 5 documents found a b c d a c d a b a b c a b d a AND b AND c AND d -> +a +b +c +d 1 documents found a b c d Using 1.3rc3, 1.3rc2 or 1.3rc1; I get the same results with a slightly different order for 1.2. So I still get the same for a OR b AND c and b AND c and a AND b OR c AND d and a AND b AND c AND d (note, that the result of the toString method of the query is equal in both cases) but different results for any operator grouping, I can think of. So to me, the question remains, what does AND and OR mean, if they are combined in one expression? I can understand all the query results where AND and OR queries are explicitly grouped by paranthesis, and the results are, what I expect. But the rules for combined AND and OR aren't what I would expect. greetings Morus PS: the test program: import org.apache.lucene.document.*; import org.apache.lucene.analysis.*; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.search.*; import org.apache.lucene.queryParser.QueryParser; class LuceneTest { static String[] docs = { "a", "b", "c", "d", "a b", "a c", "a d", "b c", "b d", "c d", "a b c", "a b d", "a c d", "b c d", "a b c d" }; static String[] queries = { "a OR b AND c", "(a OR b) AND c", "a OR (b AND c)", "b AND c", "a AND b OR c AND d", "(a AND b) OR (c AND d)", "a AND (b OR c) AND d", "((a AND b) OR c) AND d", "a AND (b OR (c AND d))", "a AND b AND c AND d" }; public static void main(String argv[]) throws Exception { Directory dir = new RAMDirectory(); String[] stop = {}; Analyzer analyzer = new StandardAnalyzer(stop); IndexWriter writer = new IndexWriter(dir, analyzer, true); for ( int i=0; i < docs.length; i++ ) { Document doc = new Document(); doc.add(Field.Text("text", docs[i])); writer.addDocument(doc); } writer.close(); Searcher searcher = new IndexSearcher(dir); for ( int i=0; i < queries.length; i++ ) { Query query = QueryParser.parse(queries[i], "text", analyzer); System.out.println(queries[i] + " -> " + query.toString("text")); Hits hits = searcher.search(query); System.out.println(" " + hits.length() + " d