Re: Storing info about the index in the index
you could use a special document in the index to do this. I was thinking about this way, but I feel this solution very ugly :) You could also keep a .properties or .xml file alongside the index. Can I store such a file inside the index directory? Will Lucene delete my file at some event? (at optimize, or whatever) Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - Easier than ever with enhanced search. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
Following up on PA's reply. Yes, DateFilter works on *indexed* values, so whether a field is stored or not is irrelevant. Great news, thanx! However, DateFilter will not work on fields indexed as 2004-11-05. DateFilter only works on fields that were indexed using the DateField. Well, can you post here a short example? When I currently type xxx.UnStored(.. I can simply type xxx.DateField(.. ? Does it take strings like 2004-11-05? One option is to use a QueryFilter instead, filtering with a RangeQuery. I've read somewhere that classic range filtering can easily exceed the maximum number of boolean query clauses. I need to filter a very large range of dates with day accuracy and I don't want to increase the max. clause count to very high values. So, I decided to use DateFilter which has no such problems AFAIK. How much impact does DateFilter have on search times? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: DateFilter on UnStored field
DateField has a utility method to return a String: DateField.timeToString(file.lastModified()) You'd use that String to pass to Field.UnStored. I recommend, though, that you use a different format, such as the -MM-DD format you're using. Well, I read -MM-DD format string from a database. So, I need to know how to convert -MM-DD to DateField.timeToString()'s result format. Or I have to convert -MM-DD to file.lastModified()'s format which I can pass to DateField.timeToString(). What is the easiest solution? In Lucene's latest codebase (though not in 1.4.x) includes RangeFilter which would do the trick for you. If you want to stick with Lucene 1.4.x, that's fine... just grab the code for that filter and use it as a custom filter - its compatible with 1.4.x. So, why do you recommend RangeFilter over DateFilter? Does it require less index data or/and has it better performance? (I'm using 1.4.2) It depends on whether you instantiate a new filter for each search. Building a filter requires scanning through the terms in the index to build BitSet for the documents that fall in that range. Filters are best used over multiple searches. Simply saying: I let the user to enter the search string on a HTML form, then I call my custom lucene-based java class through command line (the calling method may change to the PHP-to-JAVA bridge if it'll be perfect for my needs). So, every search is a whole new round. New HTML FORM post - new command line JVM call - new index searcher, etc... The OS is caching the index file pretty well (only the memory size is the limit of course). Will my implementation's performance drop down a lot when I implement DateFilter? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DateFilter on UnStored field
Hi! Does DateFilter work on fields indexed as UnStored? Can I filter an UnStored field with values like 2004-11-05 ? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Thanx a lot! Sanyi --- [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Howdy, [...] __ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PHP-Lucene Integration
Hi! Can you please explain how did you implement the java and php part to let them communicate through this bridge? The brige's project summary talks about java application-server or a dedicated java process and I'm not into Java that much. Currenty I'm using a self-written command-line search program and it outputs its results to the standard output. I guess your solution must be better ;) If the communication parts of your code aren't top secret, can you please share them with me/us? Regards, Sanyi __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Synonyms for AND/OR/NOT operators
Hi! What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. Thank you for your attention! Sanyi __ Do you Yahoo!? Send holiday email and support a worthy cause. Do good. http://celebrity.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Hi! I think we're talking about different things. My question is about using synonyms for AND/OR/NOT operators, not about synonyms of words in the index. For example, in some language: AND = AANNDD; OR = OORR; NOT = NNOOTT So, the user can enter: (cat OR kitty) AND black AND tail and either: (cat OORR kitty) AANNDD black AANNDD tail Both sets of operators must work. It must be some kind of a query parser modification/parametering, so there is nothing to do with the index. I hope I was more specific now ;) Thanx, Sanyi --- Erik Hatcher [EMAIL PROTECTED] wrote: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. If the synonym mappings are static and you don't mind a larger index, adding them during indexing avoids the complexity of rewriting the query. Injecting synonyms during querying allows the synonym mappings to change dynamically, though does produce more complex queries. Here's an example you'll find with the source code distribution of Lucene in Action which uses WordNet to look up synonyms. Erik p.s. I'm sensitive to over-marketing Lucene in Action in this forum as it would bother me to constantly see an advertisement. You can be sure that any mentions of it from me will coincide with concrete examples (which are freely available) that are directly related to questions being asked. % ant -emacs SynonymAnalyzerViewer Buildfile: build.xml check-environment: compile: build-test-index: build-perf-index: prepare: SynonymAnalyzerViewer: Using a custom SynonymAnalyzer, two fixed strings are analyzed with the results displayed. Synonyms, from the WordNet database, are injected into the same positions as the original words. See the Analysis chapter for more on synonym injection and position increments. The Tools and extensions chapter covers the WordNet feature found in the Lucene sandbox. Press return to continue... Running lia.analysis.synonym.SynonymAnalyzerViewer... 1: [quick] [warm] [straightaway] [spry] [speedy] [ready] [quickly] [promptly] [prompt] [nimble] [immediate] [flying] [fast] [agile] 2: [brown] [brownness] [brownish] 3: [fox] [trick] [throw] [slyboots] [fuddle] [fob] [dodger] [discombobulate] [confuse] [confound] [befuddle] [bedevil] 4: [jumps] 5: [over] [o] [across] 6: [lazy] [faineant] [indolent] [otiose] [slothful] 7: [dogs] 1: [oh] 2: [we] 3: [get] [acquire] [aim] [amaze] [arrest] [arrive] [baffle] [beat] [become] [beget] [begin] [bewilder] [bring] [can] [capture] [catch] [cause] [come] [commence] [contract] [convey] [develop] [draw] [drive] [dumbfound] [engender] [experience] [father] [fetch] [find] [fix] [flummox] [generate] [go] [gravel] [grow] [have] [incur] [induce] [let] [make] [may] [mother] [mystify] [nonplus] [obtain] [perplex] [produce] [puzzle] [receive] [scram] [sire] [start] [stimulate] [stupefy] [stupify] [suffer] [sustain] [take] [trounce] [undergo] 4: [both] 5: [kinds] 6: [country] [state] [nationality] [nation] [land] [commonwealth] [area] 7: [western] [westerly] 8: [bb] BUILD SUCCESSFUL Total time: 10 seconds - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Dress up your holiday email, Hollywood style. Learn more. http://celebrity.mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Synonyms for AND/OR/NOT operators
Well, I guess I'd better recognize and replace the operator synonyms to their original format before passing them to QueryParser. I don't feel comfortable tampering with Lucene's source code. Anyway, thanx for the answers. Sanyi --- Morus Walter [EMAIL PROTECTED] wrote: Erik Hatcher writes: On Dec 21, 2004, at 3:04 AM, Sanyi wrote: What is the simplest way to add synonyms for AND/OR/NOT operators? I'd like to support two sets of operator words, so people can use either the original english operators and my custom ones for our local language. There are two options that I know of: 1) add synonyms during indexing and 2) add synonyms during querying. Generally this would be done using a custom analyzer. I guess you missunderstood the question. I think he want's to know how to create a query parser understanding something like 'a UND b' as well as 'a AND b' to support localized operator names (german in this case). AFAIK that can only be done by copying query parsers javacc-source and adding the operators there. Shouldn't be difficult, though it's a bit ugly since it implies code duplication. And there will be no way of choosing the operators dynamically at runtime. One will need to have different query parsers for different languages. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
What is the best file system for Lucene?
Hi! I'm testing Lucene 1.4.2 on two very different configs, but with the same index. I'm very surprised by the results: Both systems are searching at about the same speed, but I'd expect (and I really need) to run Lucene a lot faster on my stronger config. Config #1 (a notebook): WinXP Pro, NTFS, 1.8GHz Pentium-M, 768Megs memory, 7200RPM winchester Config #2 (a desktop PC): SuSE 9.1 Pro, resiefs, 3.0GHZ P4 HT (virtually two 3.0GHz P4s), 3GByte RAM, 15000RPM U320 SCSI winchester You can see that the hardware of #2 is at least twice better/faster than #1. I'm searching the reason and the solution to take advantage of the better hardware compared to the poor notebook. Currently #2 can't amazingly outperform the notebook (#1). The question is: What can be worse in #2 than on the poor notebook? I can imagine only software problems. Which are the sotware parts then? 1. The OS Is SuSE 9.1 a LOT slower than WinXP pro? 2. The file system Is reisefs a LOT slower than NTFS? Regards, Sanyi __ Do you Yahoo!? Yahoo! Mail - You care about security. So do we. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
Interesting, what are your merge settings Sorry, I didn't mention that I was talking about search performance. I'm using the same, fully optimized index on both systems. (I've generated both indexes with the same code from the same database on the actual OS) which JDK are you using? I'm using the same Sun JDK on both systems. I've tried so far: j2sdk1.4.2_04 _05 and _06. I didn't notice speed differences between these subversions. Do you know about significant speed differences between them I should notice? Have you tried with hyperthreading turned off on #2? No, but I will try it if the problem isn't in the file system. I hope that the reason of slowness is reiserfs, because it is the easiest to change. What file systems are you people using Lucene on? And what are your experiences? Regards, Sanyi __ Do you Yahoo!? The all-new My Yahoo! - What will yours do? http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
Could you try XP on your desktop Sure, but I'll only do that I run out of ideas. so your desktop is actually using a 1.5GHz CPU for the search. No, this is not true. It uses a 3.0GHz P4 then. (HT means that you have two 3.0GHz P4s) So, it is still surprising to me. Regards, Sanyi __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: What is the best file system for Lucene?
The notebook is quite good, e.g. the Pentium-M might be faster than your Pentium 4. At least it has a similar speed, because of it better internal design. Never compare cpus of different types by their frequency. Ok, this might be true, but: All of my other tests where the CPU is involved, are running a LOT faster on the desktop PC with the 3GHz P4. Even other JAVA parts are running a LOT faster. (twice as fast nearly) So, we can't even say that the JAVA VM takes no advantage of the 3GHz P4 compared to the 1.8GHz Pentium-M. Everything is a LOT faster, except searching with lucene. (which is also a bit faster, but slightly) Maybe your index is small enough to fit into the cache provided by the operating systems. So you wouldn't recognize any difference between your hard disks. It is a 3GByte index and I always reboot between tests, so cahcing is not the case. I don't think so. I'm using Windows 2000 pro and SuSE 9.0 and (from my memory) Linux seems to be sightly faster, but I can't provide any benchmark now. Are you using reiserfs with SuSE? Regards, Sanyi __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
How large is the index? If it's less than a couple of GByte then it will be entirely in memory It is 3GBytes big and it will grow a lot. I have to search from the HDD which is very fast compared to the notebook's HDD. Average seek time: Notebook: 8-9ms Desktop: 3.9ms Data read: Notebook: max. ~20MBytes/sec Desktop: 60-80MBytes/sec So, if the bottleneck is the HDD, it has to be 2x-3x faster on the desktop system. Except if reiserfs is a lot slower than NTFS. For example (and this is only an example) looking up a hostname in the DNS will take about the same time on almost any machine you can get hold of. Ok, but I have very simple and pure tests and everything is measured part-by-part. ..and every parts speeds up a lot on the desltop system, except the lucene search part. You don't say how you're measuring search performance and you don't say what you're seeing. I call my java program from command line on both systems, like: search hello Then it searches for bravo and collects the elapsed milliseconds between every call to anything. Then it displays the results. It is very simple. Also, what's the load on the system while you're running the tests? gkrellm on Linux is very useful as an overall view -- are you CPU bound, are you seeing lots of disk traffic? Is the system actually more-or-less idle? Thanx for the hint. Since my search searches for only 30 hits, it completes too fastly to let me monitor it real-time. Anyway, if reiserfs will prove to be fast enough, I'll search for other reasons and will perform longer tests for real-time monitoring. Regards, Sanyi __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
simply load your index into a RAMDirectory instead of using FSDirectory. I have 3GByte RAM and my index is 3GByte big currently. (it'll be soon about 4GByte) So, I have to find out this another way. First off, 1.8GHz Pentium-M machines are supposed to run at about the speed of a 2.4GHz machine. The clock speeds on the mobile chips are lower, but they tend to perform much better than rated. I recommend you take a general benchmark of both machines testing both disk speed and cpu speed to get a baseline performance comparision. I think that it a good general benchmark that almost everything runs at least twice as fast on the 3.0GHz P4 except lucene search. I can tell one more interesting info: I have a MySQL table with ~20million records. I throw a DROP INDEX on that table, MySQL rebuilds the whole huge table into a tempfile. It completes in 30 minutes on both systems. It doesn't matter again that the 15kRPM U320 HDD is 2x-3x as fast. Very surprising again. Hmm... reiserfs must be very-very slow, or I'm completly lost :) I also suggest turning of HT for your benchmarks and performance testing. I'll try this later and I really hope it won't be the reason. Secondly, while the second machine appears to be twice as fast, the disk could actually perform slower on the Linux box, especially if the notebook drive has a big (8M) cache like most 7200RPM ata disk drives do. Both drives have 8M cache. I imagine that if you hit the index with lots of simultaneous searches, that the Linux box would hold its own for much longer than the XP box simply due to the random seek performance of the scsi disk combined with scsi command queueing. Are you saying that SCSI command queuing wastes more time than a 15kRPM 3.9ms HDD can gain over a 7.2kRPM 8-9ms HDD? It sounds terrible and I hope it isn't true. RAM speed is a factor too. Is the p4 a xeon processor? The older HT xeons have a much slower bus than the newer p4-m processors. Memory speed will be affected accordingly. It is not a Xeon, just a P4 3.0GHz HT. I haven't heard of a hard disk referred to as a winchester disk in a very long time :) ;) Once you have an idea of how the two machines actually compare performance-wise, you can then judge how they perform index operations. Lucene indexing completes in 13-15 hours on the desktop system while it completes in about 29-33 hours on the notebook. Now, combine it with the DROP INDEX tests completing in the same amount of time on both and find out why is the search only slightly faster :) Until then, all your measurements are subjective and you don't gain much by comparing the two indexing processes. I'm worried about searching. Indexing is a lot faster on the desktop config. Regards, Sanyi __ Do you Yahoo!? All your favorites on one personal page Try My Yahoo! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What is the best file system for Lucene?
Thanx for the replies to you all. I was looking for someone with the same experiences as mine ones, but it seems that I'll have to test this myself. I'll try out my ideas and the most interesting ideas from you guys. Regards, Sanyi __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
why reindex? Well, since I had different experiences with different analyzers I've tried, I thougt that this problem must origin from either the indexing or a lucene bug. As stated at the end of my mail, I'd expect that to skip the first term in the enum. Yes, this must be a problem for me, since I took this sentence from the manual as the starting point: Returns the current Term in the enumeration. Initially invalid, valid after next() called for the first time. So, it seems that it was a bug in the docs, not the api itself. Is that, what you miss or do you loose more than one term? It seemed to me that it was skipping more stuff, but I'd better not say this, since I didn't know that the term is valid even before the first next(), so I could've been misleaded by my own chaotic experiences. Since my code was completly restructured since then, I don't have all the surrounging stuff needed for further testing. Anyway, we've found a docs bug thanks to you and my code is cleaner and better the other way. Thanx! __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: WildcardTermEnum skipping terms containing numbers?!
Enumerating the terms using WildcardTermEnum and an IndexReader seems to be too buggy to use. I'm now reimplementing my code using WildcardTermEnum.wildcardEquals which seems to be better so far. --- Sanyi [EMAIL PROTECTED] wrote: Hi! I have following problem with 1.4.2: I'm searching for c?ca (using StandardAnalyzer) and one of the hits looks something like this: blabla c0ca c0la etc.. etc... (those big o-s are zero characters) Now, I'm enumerating the terms using WildcardTermEnum and all I get is: caca ccca ceca cica coca crca csca cuca cyca It doesn't know about c0ca at all. Is there any solution to come over this problem? Thanks, Sanyi __ Do you Yahoo!? The all-new My Yahoo! - Get yours free! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Meet the all-new My Yahoo! - Try it today! http://my.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
- leave the current implementation, raising an exception; - handle the exception and limit the boolean query to the first 1024 (or what ever the limit is) terms; - select, between the possible terms, only the first 1024 (or what ever the limit is) more meaningful ones, leaving out all the others. I like this idea and I would finalize to myself like this: I'd also create a default rule for that to avoid handling exceptions for people who're happy with the default behavior: Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc.. So, it'll automatically lower the search overhead and will still search fine without throwing exceptions. (for people who prefer the widest search range and do not care about the huge overhead, we could leave a boolean switch for keeping not the longest, but the shortest fragments) __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone implemented custom hit ranking?
Hi! I have problems with short text ranking. I've read about same raking problems in the list archives, but found only hints and toughts (adjust DefaultSimilarity, Similarity, etc...), not complete solutions with source code. Anyone implemented a good solution for this problem? (example: my search application returns about 10-20 pages of 1-2 word hits for hello, and then it starts to list the longer texts) I've implemented a very simple solution: I boost documents shorter than 300 chars with 1/300*doclength at index time. Now it works a lot better. In fact, I can't see any problems now. Anyway, I think this is not the solution, this is a patch or workaround. So, I'd be interested in some kind of well designed complete solution for this problem. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase search for more than 4 words throws exception in QueryParser
It works for me too on linux. Thanks for the test! --- Morus Walter [EMAIL PROTECTED] wrote: Sanyi writes: How to perform phrase searches for more than four words? This works well with 1.4.2: aa bb cc dd I pass the query as a command line parameter on XP: \aa bb cc dd\ QueryParser translates it to: text:aa text:bb text:cc text:dd Runs, searches, finds proper matches. This throws exeption in QueryParser: aa bb cc dd ee I pass the query as a command line parameter on XP: \aa bb cc dd ee\ The exception's text is: : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 13. Encountered: EOF after : \aa bb cc dd Works for me on linux: java -cp lucene.jar org.apache.lucene.queryParser.QueryParser 'a b c d e f g h i j k l m n o p q r s t u v w x y z' a b c d e f g h i j k l m n o p q r s t u v w x y z Must be an XP command line problem. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
It is normally possible to reduce the numbers of such complaints a lot by imposing a minimum prefix length I've alread limited it to a minimum of 5 characters (abcde*). I can still easily find (for the first try) situations where it starts to search for minutes. While another 5 char. partial words are searching for a second. So, this is not a solution at all. and eg. doubling or tripling the max. nr. of clauses. This is the only useful thing I could do and the other way I've found is similar: Unlimiting the number of clauses, but limiting the memory given for java. It'll the throw an exception if things are getting too hard for the searcher. Anyway, this avoids DoS attacks, but results in very poor user interface and search abiliy. For example: rareword AND commonfragment* would still refuse to work. I won't be able to explain it to my users, since they don't need my technical reasons. They'll only notice that dodge AND vip* fails to search instead of returning 1000 documents. If I unlimit everything and don't care about possible DoS attacks, it is still poor. It'll search for dodge AND vip* for two minutes, just because vip* is too common in the entire document set. It doesn't matter that dodge is pretty rare and we're AND-ing it with vip*. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in the BooleanQuery optimizer? ..TooManyClauses
Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
Yes, I understand all of this, but I don't want to set it to MaxInt, since it can easily lead to (even accidental) DoS attacks. What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024 variations when I search for somerareword AND wild*, since somerareword is only present in let's say 100 documents, so wild* should only expand to words beginning with wild in those 100 documents, then it should work fine with the default 1024 clause limit. But it doesn't, so I can choose between unuseable queries or accidental DoS attacks. --- Will Allen [EMAIL PROTECTED] wrote: Any wildcard search will automatically expand your query to the number of terms it find in the index that suit the wildcard. For example: wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your index. It is because of this, that you quickly reach the 1024 limit of clauses. I automatically set it to max int with the following line: BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); -Original Message- From: Sanyi [mailto:[EMAIL PROTECTED] Sent: Thursday, November 11, 2004 6:46 AM To: [EMAIL PROTECTED] Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a wrong thread.. __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bug in the BooleanQuery optimizer? ..TooManyClauses
That's the point: there is no query optimizer in Lucene. Sorry, I'm not very much into Lucene's internal Classes, I'm just telling your the viewpoint of a user. You know my users aren't technicians, so answers like yours won't make them happy. They will only see that I randomly don't allow them to search (with the 1024 limit). They won't understand why am I displaying Please restrict your search a bit more.. when they've just searched for dodge AND vip* and there are only a few documents mathcing this criteria. So, is the only way to make them able to search happily by setting the max. clause limit to MaxInt?! __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Phrase search for more than 4 words throws exception in QueryParser
Hi! How to perform phrase searches for more than four words? This works well with 1.4.2: aa bb cc dd I pass the query as a command line parameter on XP: \aa bb cc dd\ QueryParser translates it to: text:aa text:bb text:cc text:dd Runs, searches, finds proper matches. This throws exeption in QueryParser: aa bb cc dd ee I pass the query as a command line parameter on XP: \aa bb cc dd ee\ The exception's text is: : org.apache.lucene.queryParser.ParseException: Lexical error at line 1, column 13. Encountered: EOF after : \aa bb cc dd It doesn't matter what words I enter, the only thing that matters is the number of words which can be four at max. Regards, Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stopword AND validword throws exception
Hi! I've left out custom stopwords from my index using the StopAnalyzer(customstopwords). Now, when I try to searh the index the same way (StopAnalyzer(customstopwords)), it seems to act strange: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? (lucene-1.4-final.jar) Thanks for your time! Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
Thanx for your replies guys. Now, I was trying to locate the latest patch for this problem group, and the last thread I've read about this is: http://issues.apache.org/bugzilla/show_bug.cgi?id=25820 It ends with an open question from Morus: If you want me to change the patch, let me know. That no big deal. Did you change the patch since then? In other words: What is the latest development in this topic? Can I simply download the latest compiled development version of lucene.jar and will it fix my problem? The lastest builds I could find are these: http://cvs.apache.org/builds/jakarta-lucene/nightly/2003-09-09/ It seems to be quite old, so please help me out! Thanx, Sanyi --- Morus Walter [EMAIL PROTECTED] wrote: Sanyi writes: This query works as expected: validword AND stopword (throws out the stopword part and searches for validword) This query seems to crash: stopword AND validword (java.lang.ArrayIndexOutOfBoundsException: -1) Maybe it can't handle the case if it had to remove the very first part of the query?! Can anyone else test this for me? How can I overcome this problem? see bug: http://issues.apache.org/bugzilla/show_bug.cgi?id=9110 Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stopword AND validword throws exception
But the fix seems to be included in 1.4.2. see http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-lucene/CHANGES.txt?rev=1.96.2.4 item 5 Thank you! I'm just downloading 1.4.2. I hope it'll work ;) Sanyi __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Bug in the BooleanQuery optimizer? ..TooManyClauses
Hi! First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses limit by default which is good enough for me, but I still think it works strange. Example: I have an index with about 20Million documents. Let's say that there is about 3000 variants in the entire document set of this word mask: cab* Let's say that about 500 documents are containing the word: spectrum Now, when I search for cab* AND spectrum, I don't expect it to throw an exception. It should first restrict the search for the 500 documents containing the word spectrum, then it should collect the variants of cab* withing these documents, which turns out in two or three variants of cab* (cable, cables, maybe some more) and the search should return let's say 10 documents. Similar example: When I search for cab* AND nonexistingword it still throws a TooManyClauses exception instead of saying No results, since there is no nonexistingword in my document set, so it doesn't even have to start collecting the variations of cab*. Is there any path for this issue? Thank you for your time! Sanyi (I'm using: lucene 1.4.2) __ Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]