Re: Lucene search result no stable
Ardor Wei writes: What might be the problem? How to solve it? Any suggestion or idea will be appreciated. The only problem with locking I saw so far is that you have to make sure that the temp dir is the same for all applications. Lucene 1.3 stores it's lock in the directory that is defined by the system property java.io.tmpdir. I had one component running under tomcat and one from the shell and they used different temp dirs which is fatal in this case. Apart from this it depends pretty much on your environment. I'm using lucene on linux on local filesystems. Other operating systems or network filesystems may influence locking. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Hi Doug, thank you for the answer so far. I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. A direct match will not occur anyway. How can I make a most Vector Space Model (VSM) like query (each word a dimension value - find documents close to that)? You know as good as I that the standard VSM does not have any Boolean logic inside... how do I need to formuate the query to make it as much similar to a vector in order to find similar document in the vector space of the Lucene index? Cheers, Karl setMaxClauseCount determines the maximum number of clauses, which is not your problem here. Your problem is with required clauses. There may only be a total of 31 required (or prohibited) clauses in a single BooleanQuery. If you need more, then create more BooleanQueries and combine them with another BooleanQuery. Perhaps this could be done automatically, but I've never heard anyone encounter this limit before. Do you really mean for 32 different terms to be required? Do any documents actually match this query? Doug Karl Koch wrote: Hi group, I run over a IndexOutOfBoundsException: - java.lang.IndexOutOfBoundsException: More than 32 required/prohibited clauses in query. The reason: I have more then 32 BooleanCauses. From the Mailinglist I got the info how to set the maxiumum number of clauses higher before a loop: ... myBooleanQuery.setMaxClauseCount(Integer.MAX_VALUE); while (true){ Token token = tokenStream.next(); if (token == null) { break; } myBooleanQuery.add(new TermQuery(new Term(bla, token.termText())), true, false); } ... However the error still remains, why? Karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Vector - LinkedList for performance reasons...
I'm looking at a lot of the code in Lucene... I assume Vector is used for legacy reasons. In an upcoming version I think it might make sense to migrate to using a LinkedList... since Vector has to do an array copy when it's exhausted. It's also synchronized which kind of sucks... I'm seeing this being used in a lot of tight loops so things might be sped up a bit by using Collections ... Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Dean in 2004! - http://blog.deanforamerica.com/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector - LinkedList for performance reasons...
I agree that synchronization in Vector is a waste of time if it isn't required, but I'm not sure if LinkedList is a better (faster) choice than ArrayList. I think only a profiler could tell. Francesco Kevin A. Burton [EMAIL PROTECTED] wrote: I'm looking at a lot of the code in Lucene... I assume Vector is used for legacy reasons. In an upcoming version I think it might make sense to migrate to using a LinkedList... since Vector has to do an array copy when it's exhausted. It's also synchronized which kind of sucks... I'm seeing this being used in a lot of tight loops so things might be sped up a bit by using Collections ... Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Dean in 2004! - http://blog.deanforamerica.com/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 4D20 40A0 C734 307E C7B4 DCAA 0303 3AC5 BD9D 7C4D IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Francesco Bellomi Use truth to show illusion, and illusion to show truth. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Karl Koch wrote: Hi Doug, thank you for the answer so far. I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. A direct match will not occur anyway. How can I make a most Vector Space Model (VSM) like query (each word a dimension value - find documents close to that)? You know as good as I that the standard VSM does not have any Boolean logic inside... how do I need to formuate the query to make it as much similar to a vector in order to find similar document in the vector space of the Lucene index? You should try to reduce the dimensionality by reducing the number of unique features. In this case, you could for example use only keywords (or key phrases) instead of the full content of documents. -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTML tagged terms boosting...
Hello! Is there any idea how to achieve boosting terms in HTML-documents surrounded by HTML tags, such as B, H1, etc.? Can it be done with use of existing API or reimplemeting or implementation of TokenStream with custom Token types is needed? Though it seems to me, that even such re-implementation won't help without changing indexing and searcher code... Hope that I'm wrong. Thanks in advance. Alexey. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTML tagged terms boosting...
It definitely cannot be done with custom token types. You're probably aiming for field-specific boosting, so you will need to parse the HTML into separate fields and use a multi-field search approach. I'm sure there are other tricks that could be used for boosting, like inserting the words inside b multiple times into the same field for example. Erik On Jan 21, 2004, at 6:50 AM, Alexey Maksakov wrote: Hello! Is there any idea how to achieve boosting terms in HTML-documents surrounded by HTML tags, such as B, H1, etc.? Can it be done with use of existing API or reimplemeting or implementation of TokenStream with custom Token types is needed? Though it seems to me, that even such re-implementation won't help without changing indexing and searcher code... Hope that I'm wrong. Thanks in advance. Alexey. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: HTML tagged terms boosting...
Thanks for answer. Yes I'm up to field specific boosting, but also I'm looking for creating short descriptions on documents found, based on query (like it is done in most search engines). I've thought about those solutions but it seemed to me that it is not straightforward and will cause troubles when building results' description. On second thought answer was found - analyze document as stream and put terms into separate fields (or create duplicates) while maintaining original offsets in Token objects. After that building description is quite simple - just using TermPositions in IndexReader and than getting corresponding text portion(s) from Field (sadly it'll work only in case of one body field - so only duplicates are usable, several Fields I think will require extra unindexed body Field to fetch document pieces fast). Hope I've not missed anything... Hm... not transparent, it is. :-) just hope it helps somebody else. Erik Hatcher [EMAIL PROTECTED] 21.01.2004 15:27 Please respond to Lucene Users List To: Lucene Users List [EMAIL PROTECTED] cc: Subject:Re: HTML tagged terms boosting... It definitely cannot be done with custom token types. You're probably aiming for field-specific boosting, so you will need to parse the HTML into separate fields and use a multi-field search approach. I'm sure there are other tricks that could be used for boosting, like inserting the words inside b multiple times into the same field for example. Erik On Jan 21, 2004, at 6:50 AM, Alexey Maksakov wrote: Hello! Is there any idea how to achieve boosting terms in HTML-documents surrounded by HTML tags, such as B, H1, etc.? Can it be done with use of existing API or reimplemeting or implementation of TokenStream with custom Token types is needed? Though it seems to me, that even such re-implementation won't help without changing indexing and searcher code... Hope that I'm wrong. Thanks in advance. Alexey. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote: 1) Is there a way to set the query boost factor depending not on the presence of a term, but on the presence of two specific terms? For example, I may want to boost the relevance of a document that contains both iraq and clerics, but not boost the relevance of documents that contain only one or the other terms. (The idea is better discrimination than if I simply boosted both terms.) But doesn't the query itself take this into account? If there are multiple matching terms then the overlap (coord) factor kicks in. 2) Is it possible to apply (or simulate) a negative query boost factor? For example, I may have a complex query with lots of terms but want to reduce the relevance of a matching document that also included the term iowa. ( The idea is for an easier and more discriminating way than simply increasing the relevance of all other terms besides iowa). Another reply mentioned negative boosting. Is that not working as you'd like? 3) Is there a way to handle variants of a phrase without OR'ing together the variants? For example, I may want to find documents dealing with North Korea; the terms might be north korea or north korean or north koreans - is there a way to handle this with a single term using wildcards? Sounds like what you're really after is fancier analysis. This is one of the purposes of analysis, to do stemming. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
Erik, Thanks for your response. My specific comments (TS==) are inserted below. I should make clear that I'm using fairly complex, embedded queries - not ones that the user is expected to enter. Regards, Terry - Original Message - From: Erik Hatcher [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, January 21, 2004 9:31 AM Subject: Re: Query Term Questions On Jan 20, 2004, at 10:22 AM, Terry Steichen wrote: 1) Is there a way to set the query boost factor depending not on the presence of a term, but on the presence of two specific terms? For example, I may want to boost the relevance of a document that contains both iraq and clerics, but not boost the relevance of documents that contain only one or the other terms. (The idea is better discrimination than if I simply boosted both terms.) But doesn't the query itself take this into account? If there are multiple matching terms then the overlap (coord) factor kicks in. TS==Except that I'd like to be able to choose to do this on a query-by-query basis. In other words, it's desirable that some specific queries significantly increase their discrimination based on this multiple matching, relative to the normal extra boost given by the coord factor. However, I take it from your answer that there's not a way to do this in the query itself (at least using the unmodified, standard Lucene version). 2) Is it possible to apply (or simulate) a negative query boost factor? For example, I may have a complex query with lots of terms but want to reduce the relevance of a matching document that also included the term iowa. ( The idea is for an easier and more discriminating way than simply increasing the relevance of all other terms besides iowa). Another reply mentioned negative boosting. Is that not working as you'd like? TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. 3) Is there a way to handle variants of a phrase without OR'ing together the variants? For example, I may want to find documents dealing with North Korea; the terms might be north korea or north korean or north koreans - is there a way to handle this with a single term using wildcards? Sounds like what you're really after is fancier analysis. This is one of the purposes of analysis, to do stemming. TS==Well, I hope I'm not trying to be fancy. It's just that listing all the different variants, particularly (as in my case) I have to do this for multiple fields, gets tedious and error-prone. The example above is simply one such case for a particular query - other queries may have entirely different desired combinations. Constructing a single stemmer to handle all such cases would be (for me, at least) very difficult. Besides, I tend to stay away from stemming because I believe it can introduce some rather unpredictable side-effects. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
On Jan 21, 2004, at 10:01 AM, Terry Steichen wrote: But doesn't the query itself take this into account? If there are multiple matching terms then the overlap (coord) factor kicks in. TS==Except that I'd like to be able to choose to do this on a query-by-query basis. In other words, it's desirable that some specific queries significantly increase their discrimination based on this multiple matching, relative to the normal extra boost given by the coord factor. However, I take it from your answer that there's not a way to do this in the query itself (at least using the unmodified, standard Lucene version). Don't interpret my replies as being absolute here - I'm still learning lots about Lucene and am open to being shown new ways of doing things with it. Another reply mentioned negative boosting. Is that not working as you'd like? TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. Have you tried it without using QueryParser and boosting a Query using setBoost on it? QueryParser is a double-edged sword and it looks like it only allows numeric characters (plus . followed by numeric characters). So QueryParser has the problem with negative boosts, but not Query itself. Sounds like what you're really after is fancier analysis. This is one of the purposes of analysis, to do stemming. TS==Well, I hope I'm not trying to be fancy. It's just that listing all the different variants, particularly (as in my case) I have to do this for multiple fields, gets tedious and error-prone. The example above is simply one such case for a particular query - other queries may have entirely different desired combinations. Constructing a single stemmer to handle all such cases would be (for me, at least) very difficult. Besides, I tend to stay away from stemming because I believe it can introduce some rather unpredictable side-effects. I'd still recommend trying some of the other analyzer options out there and seeing if you can tweak things to your liking. This is really the answer for what you are after, I'm almost certain. Good stemmers exist - look at the Porter one or the Snowball ones. Write some test cases to analyze the analyzer like I did in my java.net articles - it really will let you experiment with indexing and searching easily. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector - LinkedList for performance reasons...
Hi, I'd like to help working on improving Lucene. How can I help? Le Mercredi 21 Janvier 2004 16:38, Doug Cutting a écrit : Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization actually impairs overall performance significantly. This would be fairly simple to test. but I'm not sure if LinkedList is a better (faster) choice than ArrayList. Correct. ArrayList is the substitute for Vector. One could also try replacing Hashtable with HashMap in many places. I think only a profiler could tell. I wouldn't trust a profiler for this. Rather, I'd perform benchmarks before and after the change will best show real performance. A substantial indexing benchmark and some search benchmarks, searching fairly large indexes would be good. My hunch is that the speedup will not be significant. Synchronization costs in modern JVMs are very small when there is no contention. But only measurement can say for sure. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
Erik Hatcher writes: TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. Have you tried it without using QueryParser and boosting a Query using setBoost on it? QueryParser is a double-edged sword and it looks like it only allows numeric characters (plus . followed by numeric characters). So QueryParser has the problem with negative boosts, but not Query itself. He said he wants to have one term less important than others (at least that's what I understood). That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1) and might be called 'negative boosting' (such as breking is a form of negative acceleration). If you use negative boost factors you would even decrease the score of a match (not only increase it less) and risk of ending with a negative score. I don't think that would be a good idea. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and stopwords
Hello Morus, --- Morus Walter [EMAIL PROTECTED] wrote: Hi, I'm currently trying to get rid of query parser problems with stopwords (depending on the query, there are ArrayIndexOutOfBoundsExceptions, e.g. for stop AND nonstop where stop is a stopword and nonstop not). While this isn't hard to fix (I'll enter a bug and patch in bugzilla), There is already a bug report open for this. A very old one, too! there's one issue left, I'm not sure how to deal with: What should the query parser return for a query string containing only stopwords? null? And when I think about this, there's another one: stop AND NOT nonstop creates a boolean query, only containing prohibited terms, which AFAIK cannot be used in a search. How to deal with this? Currently it returns an empty BooleanQuery. I think it would be more useful to return null in this case. Either one should be okay. null, to be consistent with above. Looking forward to the patch for this OLD bug. Otis __ Do you Yahoo!? Yahoo! Hotjobs: Enter the Signing Bonus Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Hello Doug, that sounds interesting to me. I refer to a paper written by NIST about Relevance Feedback which was doing test with 20 - 200 words. This is why I thought it might be good to be able to use all non stopwords of a document for that and see what is happening. Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? By the way, could you send me the code of Dmitry about the Vector extension. I have been asking in another thread but I did not get it so far. I really would like to have a look... Also it would be nice to know about any status regarding the progress of integrating it in Lucene 1.3. Who is working on it and how could I contribute? Cheers, Karl Andrzej Bialecki wrote: Karl Koch wrote: I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. You should try to reduce the dimensionality by reducing the number of unique features. In this case, you could for example use only keywords (or key phrases) instead of the full content of documents. Indeed, this is a good approach. In my experience, six or eight terms are usually enough, and they needn't all be required. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Karl: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=114748 Status: several people have mentioned they wanted to work on it, but nobody has contributed any patches. The code you see at the above URL is not compatible with Lucene 1.3, but could be brought up to date. Otis --- Karl Koch [EMAIL PROTECTED] wrote: Hello Doug, that sounds interesting to me. I refer to a paper written by NIST about Relevance Feedback which was doing test with 20 - 200 words. This is why I thought it might be good to be able to use all non stopwords of a document for that and see what is happening. Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? By the way, could you send me the code of Dmitry about the Vector extension. I have been asking in another thread but I did not get it so far. I really would like to have a look... Also it would be nice to know about any status regarding the progress of integrating it in Lucene 1.3. Who is working on it and how could I contribute? Cheers, Karl Andrzej Bialecki wrote: Karl Koch wrote: I actually wanted to add a large amount of text from an existing document to find a close related one. Can you suggest another good way of doing this. You should try to reduce the dimensionality by reducing the number of unique features. In this case, you could for example use only keywords (or key phrases) instead of the full content of documents. Indeed, this is a good approach. In my experience, six or eight terms are usually enough, and they needn't all be required. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- +++ GMX - die erste Adresse für Mail, Message, More +++ Bis 31.1.: TopMail + Digicam für nur 29 EUR http://www.gmx.net/topmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Yahoo! Hotjobs: Enter the Signing Bonus Sweepstakes http://hotjobs.sweepstakes.yahoo.com/signingbonus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: setMaxClauseCount ??
there are just about as many ways of doing it as there are papers that talk about automatic relevance feedback. many require domain-specific reference documents that are full of facts and therefore good sources of related words. some people use Wordnet. some of these techniques can add 400-500 terms into a query if they are searching long documents and using reference documents that are equally long. the technique is very important only when searching long documents and almost irrelevant for very short ones. Herb -Original Message- From: Karl Koch [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 21, 2004 11:09 AM To: Lucene Users List Subject: Re: setMaxClauseCount ?? that sounds interesting to me. I refer to a paper written by NIST about Relevance Feedback which was doing test with 20 - 200 words. This is why I thought it might be good to be able to use all non stopwords of a document for that and see what is happening. Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: setMaxClauseCount ??
Karl Koch wrote: Do you know good papers about strategies of how to select keywords effectivly beyond the scope of stopword lists and stemming? Using term frequencies of the document is not really possible since lucene is not providing access to a document vector, isn't it? Lucene does let you access the document frequency of terms, with IndexReader.docFreq(). Term frequencies can be computed by re-tokenizing the text, which, for a single document, is usually fast enough. But looking up the docFreq() of every term in the document is probably too slow. You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much, or at all. Since you're trying to maximize a tf*idf score, you're probably most interested in terms with a high tf. Choosing a tf threshold even as low as two or three will radically reduce the number of terms under consideration. Another heuristic is that terms with a high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the number of characters, not selecting anything less than, e.g., six or seven characters. With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms that do a pretty good job of characterizing a document. It all depends on what you're trying to do. If you're trying to eek out that last percent of precision and recall regardless of computational difficulty so that you can win a TREC competition, then the techniques I mention above are useless. But if you're trying to provide a more like this button on a search results page that does a decent job and has good performance, such techniques might be useful. An efficient, effective more-like-this query generator would be a great contribution, if anyone's interested. I'd imagine that it would take a Reader or a String (the document's text), an Analyzer, and return a set of representative terms using heuristics like those above. The frequency and length thresholds could be parameters, etc. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
Morus, Unfortunately, using positive boost factors less than 1 causes the parser to barf the same as do negative boost factors. Regards, Terry - Original Message - From: Morus Walter [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, January 21, 2004 10:54 AM Subject: Re: Query Term Questions Erik Hatcher writes: TS==I've not been able to get negative boosting to work at all. Maybe there's a problem with my syntax. If, for example, I do a search with green beret^10, it works just fine. But green beret^-2 gives me a ParseException showing a lexical error. Have you tried it without using QueryParser and boosting a Query using setBoost on it? QueryParser is a double-edged sword and it looks like it only allows numeric characters (plus . followed by numeric characters). So QueryParser has the problem with negative boosts, but not Query itself. He said he wants to have one term less important than others (at least that's what I understood). That's done by positive boost factors smaller than 1.0 (e.g. 0.5 or 0.1) and might be called 'negative boosting' (such as breking is a form of negative acceleration). If you use negative boost factors you would even decrease the score of a match (not only increase it less) and risk of ending with a negative score. I don't think that would be a good idea. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query Term Questions
On Jan 21, 2004, at 4:21 PM, Terry Steichen wrote: PS: Is this in the docs? If not, maybe it should be mentioned. Depends on what you consider the docs. I looked at QueryParser.jj to see what it parses. Also, on http://jakarta.apache.org/lucene/docs/queryparsersyntax.html it has an example of 0.2. Documentation patches gladly accepted :)) Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector - LinkedList for performance reasons...
On Wednesday 21 January 2004 08:38, Doug Cutting wrote: Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization actually impairs overall performance significantly. This would be fairly simple to test. True. At the same time, it's questionable whether there's any benefit of not changing it to ArrayList. However: but I'm not sure if LinkedList is a better (faster) choice than ArrayList. Correct. ArrayList is the substitute for Vector. One could also try replacing Hashtable with HashMap in many places. Yes, LinkedList is pretty much never more or even as efficient (either memory or performancewise) than ArrayList. Arraycopy needed when doubling the size (which happens seldom enough when list grows) is neglible compared to increased GC activity and memory usage for entries in LinkedList (object overhead of 24 bytes for each entry, alloc/GC). And obviously indexed access is hideously slow, if that's needed. I've yet to find any use for LinkedList; it'd make sense to have some sort of combination (segmented array list, ie. linked list of arrays) for huge arrays... but LinkedList just isn't useful even there. ... My hunch is that the speedup will not be significant. Synchronization costs in modern JVMs are very small when there is no contention. But only measurement can say for sure. Apparently 1.4 specifically had significant improvement there, reducing cost of synchronization. -+ Tatu +- Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
1.3-final: now giving me java.io.FileNotFoundException (Too many open files)
I'm getting the following stack trace from lucene-1.3-final running on JDK 1.4.2_03-b02 on linux java.io.FileNotFoundException: /home/matt/blah/idx/_123n.tis (Too many open files) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.init(RandomAccessFile.java:204) at org.apache.lucene.store.FSInputStream$Descriptor.init(FSDirectory.java:389) at org.apache.lucene.store.FSInputStream.init(FSDirectory.java:418) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:291) at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:79) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:141) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:423) at org.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:401) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:260) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:244) at com.foo.Foo.perform(Foo.java:53) I've only just upgraded to 1.3-final from 1.3-RC2, and now I've started seeing this error. I'll try and trace it down further, see if it is me leaking file handles, and not Lucene. Any chance this is a Lucene bug? =Matt - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]