Re: Field.java - STORED, NOT_STORED, etc...
On Sunday 11 July 2004 10:03, Doug Cutting wrote: Doug Cutting wrote: The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); . Actually, while we're at it, Indexed and Tokenized are confounded. A single entry would be better, something like: ... then calls would look like just: new Field(name, value, Store.YES, Index.TOKENIZED); ... and adding a boolean clause would look like: booleanQuery.add(new TermQuery(...), Occur.MUST); Then we can deprecate the old methods. Comments? I was about to suggest this, instead of int/boolean constants, since it is a recommended good practice, and allows better type safety (until JDK 1.5's real enums at least). I would prefer this over un-typesafe consts; although even just defining and using simple consts in itself would be an improvement over existing situation. Another possibility (or maybe complementary approach) would be to just completely do away with constructor access; make the constructors private or protected, and only allow factory methods to be used externally. This would have the benefit of even better readability: minimum number of arguments (method name would replace one or two args) and full type checking. Plus it'd be easier to modify implementations should that become necessary. Factory methods are especially useful for classes like Field, that are not designed to be sub-classed. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Bridge with OpenOffice
On Monday 19 April 2004 14:01, Mario Ivankovits wrote: Stephane James Vaucher wrote: Anyone try what Joerg suggested here? http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED] pache.orgmsgNo=6231 Dont know what you would like to do, but if you simply would like to extract text, you could simply try this sniplet: This leads to question I was thinking; it seems that originally this thread started by someone pointing that OO can be used as converter from other formats... but how about tokenizer for native OO documents? I have written full-featured converters from OO to (simplified) DocBook and HTML, and creating one for just tokenizing to be used by Lucene would be much easier. Even if it would tokenize into separate fields (document metadata, content, maybe bibliography separately etc), it'd be easy to do. Would anyone find full-featured, customizable OpenOffice document tokenizer useful? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestion for Token.java
On Tuesday 13 April 2004 15:31, Holger Klawitter wrote: Hi Erik, What is wrong with simply creating a new token that replaces an incoming one for synonyms? I'm just playing devil's advocate here since you can already get the termText() through the public _method_. Well, you're right; I forgot about cloning, but ... (Lords advocate :-) 1.) Cloning implies the need to change filters whenever the fields in Token change. On the other hand, one needs to be sure that no other code assumes Tokens are immutable. For example, if they weren't one couldn't reliably use tokens in Sets or Maps (not sure if it's useful to do that, just an example). I guess it's really matter of whether tokens were designed as immutable (which often makes sense for similar objects), or if they just happen to be, due to lack of modifier method(s). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Zero hits for queries ending with a number
On Saturday 03 April 2004 08:34, [EMAIL PROTECTED] wrote: On Saturday 03 April 2004 17:11, Erik Hatcher wrote: No objections that error messages and such could be made clearer. Patches welcome! Care to submit better error message handling in this case? Or perhaps allow lower-case to? I think the best would be if Lucene would simply have a setCaseSensitive(boolean). IMHO it's in any case a bad idea to make searches case-sensitive (per default). I'd have to disagree. I think that search engine core should not have to bother with details of character sets, such as lower-casing. Rules for lower/upper/initial/mixed case for all Unicode-languages are rather involved... and if you tried to do that, next thing would be whether accentuation and umlaut marks should matter or not (which is language dependant). That's why to me the natural way to go is to do direct comparison, ignoring case when executing queries. This does not prevent anyone from implementing such functionality (see below). I think architecture and design of Lucene core is delightfully simple. One can easily create case-independent functionality by using proper analyzers, and (for the most part), configuring QueryParser. I would agree, however, that QueryParser is victim of its success; it's too often used in situations where one really should create proper GUI that builds the query. Backend code can then mangle input as it sees fit, and build query objects. QueryParser is more natural for quick-n-dirty scenarios, where one just has to slap something together quickly, or if one only has textual interface to deal with. It's nice thing to have, but it has its limitations; there's no way to create one parser that's perfect for every use(r). What could be done would be to make sure all examples / demo web apps would implement case-insensitive indexing and searching, since that is often what is needed? -+ Tatu +- But, also, folks need to really step back and practice basic troubleshooting skills. I asked you if that string was what you passed to the QueryParser and you said yes, when in fact it was not. And you I forgot that I did lower-case it. I fact I even output it in it's original state but lower-case it just before I pass it to lucene. That lower-casing is what I would call a hack and hence it's no surprise that I forgot it :-) Timo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performing exact search with Lucene
On Friday 02 April 2004 08:12, Phil brunet wrote: Hi all. I'm migrating a part of an application from Oracle intermedia to Lucene (1.3) to perform full text searches. Congratulations! :-) I'd like to know if there is a way to perform exact queries. By exact query, i mean beeing able to match ONLY document that are exactely equals to the terms of the query. I believe plain old PhraseQuery does exactly that? You can build one yourself, or, using QueryParser, use something like +this is an example (making sure you use correct analyzer, depending on whether you want 'an' to be significant token in there). Note, too, that '+' prefix in there is not absolutely needed, if you don't have multiple parts to the query; even without that it'd only consider documents that have that exact phrase. -+ Tatu +- Exemple: document 1 =this is an example document 2 =this is an example of document document 3 =this is an other example Is it possible to match ONLY document 1 if i search for this is an exemple ? Currently, i'm trying to override the DefaultSimilarity class in order to be be able to deduce an exact match from the score. My query consists in a BooleanQuery composed by n TermQuery. I know i can develop by myself a post filter that could count compare the number of tokens of the query and the number of tokens of the indexed document. But i would like to know if there is a proper way to do this : - directly with Lucene (i.e. a Lucene query that would match only document 1 in my example) - by redefining the Similarity and so by interpreting the scores - any idea Thanks. Philippe _ Trouvez l'âme soeur sur MSN Rencontres http://g.msn.fr/FR1000/9551 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching and paging search results
On Monday 08 March 2004 12:34, Erik Hatcher wrote: In the RealWorld... many applications actually just re-run a search and jump to the appropriate page within the hits searching is generally plenty fast enough to alleviate concerns of caching. However, if you need to cache Hits, you need to be sure to keep around the originating IndexSearcher as well. Further, oftentimes search index only contains key to actual content indexed (which itself is stored as a file, in database, or so)... so it's enough to cache just set of such ids, not actual search result objects. And assuming ids are simple (int id, short String), such information can be stored in, say, user session. In system I'm working on, we store up to 500 hits, only storing document id (int) and hit quality (byte), stored in session. -+ Tatu +- A stateful session bean could be used, but I'd opt for a much simpler solution as a first pass, such as the first point of just re-running a search from scratch. Erik On Mar 8, 2004, at 2:14 PM, Clandes Tino wrote: Hi all, could someone describe his expirience in implementation of caching, sorting and paging search results. Is Stateful Session bean appropriate for this? My wish is to obtain all search hits only in first call, and after that, to iterate through Hit Collection and display cached results. I have checked SearchBean in contribution section, but it does not provide real caching and paging. Regards and thanx in advance! Milan ___ Yahoo! Messenger - Communicate instantly...Ping your friends today! Download Messenger Now http://uk.messenger.yahoo.com/download/index.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Vector - LinkedList for performance reasons...
On Wednesday 21 January 2004 08:38, Doug Cutting wrote: Francesco Bellomi wrote: I agree that synchronization in Vector is a waste of time if it isn't required, It would be interesting to see if such synchronization actually impairs overall performance significantly. This would be fairly simple to test. True. At the same time, it's questionable whether there's any benefit of not changing it to ArrayList. However: but I'm not sure if LinkedList is a better (faster) choice than ArrayList. Correct. ArrayList is the substitute for Vector. One could also try replacing Hashtable with HashMap in many places. Yes, LinkedList is pretty much never more or even as efficient (either memory or performancewise) than ArrayList. Arraycopy needed when doubling the size (which happens seldom enough when list grows) is neglible compared to increased GC activity and memory usage for entries in LinkedList (object overhead of 24 bytes for each entry, alloc/GC). And obviously indexed access is hideously slow, if that's needed. I've yet to find any use for LinkedList; it'd make sense to have some sort of combination (segmented array list, ie. linked list of arrays) for huge arrays... but LinkedList just isn't useful even there. ... My hunch is that the speedup will not be significant. Synchronization costs in modern JVMs are very small when there is no contention. But only measurement can say for sure. Apparently 1.4 specifically had significant improvement there, reducing cost of synchronization. -+ Tatu +- Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance question
On Wednesday 07 January 2004 20:48, Dror Matalon wrote: On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote: ... Thanks for the suggestions. I wonder how much faster I can go if I implement some of those? 25 msecs to insert a document is on the high side, but it depends of course on the size of your document. You're probably spending 90% of your time in the XML parsing. I believe that there are other parsers that are faster than xerces, you might want to look at these. You might want to look at http://dom4j.org/. I think more significant than whether one uses DOM or some other full-document in-memory parser, is whether to perhaps use streaming (usually event-based) parsers such as ones using SAX. These are generally an order of magnitude faster, at least for bigger documents. Fortunately many standard XML parsers can work as both DOM and SAX parsers (I believe Xerces at least does, in any case). It's bit more cumbersome to use event-based parsers (push vs. pull; need to explicitly keep track of current subtree, if parent tag order matters), but from performance perspective (memory usage, speed) it may be worth it. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lock obtain timed out
On Tuesday 16 December 2003 03:37, Hohwiller, Joerg wrote: Hi there, I have not yet got any response about my problem. While debugging into the depth of lucene (really hard to read deep insde) I discovered that it is possible to disable the Locks using a System property. ... Am I safe disabling the locking??? Can anybody tell me where to get documentation about the Locking strategy (I still would like to know why I have that problem) ??? Or does anybody know where to get an official example of how to handle concurrent index modification and searches? One problem I have seen, and am still trying to solve, is that if my web app is terminated (running from console during development, ctrl+c on unix), sometimes it seems commit.lock file is left. Now problem is that apparently method that seems like it tries to check if there is a lock (and subsequently asking it to be removed via API) doesn't consider that to be the lock (sorry for not having details, writing this from home without source). So I'll probably see if disabling locks would get rid of this lock file (as I never have multiple writers, or even writer and reader, working on same index... I just always make full file copy of index before doing incremental updates), or physically delete commit.lock if necessary when starting the app. The problem I describe above happens fairly infrequently, but that's actually what makes it worse... our QA people (in different continent) have been bitten by a bit couple of times. :-/ -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index and Field.Text
On Friday 05 December 2003 10:45, Doug Cutting wrote: Tatu Saloranta wrote: Also, shouldn't there be at least 3 methods that take Readers; one for Text-like handling, another for UnStored, and last for UnIndexed. How do you store the contents of a Reader? You'd have to double-buffer it, first reading it into a String to store, and then tokenizing the StringReader. A key feature of Reader values is that they're streamed: Not really, you can pass Reader to tokenizer, which then reads and tokenizes directly (I think that's the way code also works). This because internally String is read using StringReader, so passing a String looks more like a convenience feature? the entire value is never in RAM. Storing a Reader value would remove that advantage. The current API makes this explicit: when you want something streamed, you pass in a Reader, when you're willing to have the entire value in memory, pass in a String. I guess for things that are both tokenized and stored, passing a Reader can't really help a lot; if one wants to reduce mem usage, text needs to be read twice, or analyzer needs to help in writing output; or, text needs to be read in-memory much like what happens now. It'd simplify application code a bit, but wouldn't do much more. So I guess I need to downgrade my suggestion to require just 2 Reader-taking factory methods? :-) I still think that index-only and store-only version would both make sense. In latter case, storing could be done in fully streaming fashion; in former tokenization can be done? Yes, it is a bit confusing that Text(String, String) stores its value, while Text(String, Reader) does not, but it is at least well documented. And we cannot change it: that would break too many applications. But we can put this on the list for Lucene 2.0 cleanups. Yes, I understand that. It'd not be reasonable to do such a change. But how about adding more intuitive factory method (UnStored(String, Reader))? When I first wrote these static methods I meant for them to be constructor-like. I wanted to have multiple Field(String, String) constructors, but that's not possible, so I used capitalized static methods instead. I've never seen anyone else do this (capitalize any method but a real constructor) so I guess I didn't start a fad! This :-) should someday too be cleaned up. Lucene was the first Java program that I ever wrote, and thus its style is in places non-standard. Sorry. Best standards are created by people doing things others use, follow or imitate... so it was worth a try! :-) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: SearchBlox J2EE Search Component Version 1.1 released
On Tuesday 02 December 2003 09:51, Tun Lin wrote: Anyone knows a search engine that supports xml formats? There's no way to generally support xml formats, as xml is just a meta-language. However, building specific search engines using Lucene core it should be reasonably straight-forward to implement more accurate xml-structure-aware tokenization for specific xml applications like DocBook or other domain-specific apps. So, if any search engine advertises indexing xml content, one better read the fine print to learn what they really claim. It might be interesting to create a Lucene plug-in that, given a specification of how sub trees under specific elements, would tokenize and index content into separate fields. Plus implementation shouldn't be very difficult -- just use standard XML parser (SAX, DOM) -- and then match xpaths, feed that to analyzer and then add to index. This could also be used for HTML (pre-filtering with JTidy or similar first to get to xml-compliant HTML). I wouldn't be surprised if someone on list has already done this? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Dates and others
On Monday 01 December 2003 15:13, Dion Almaer wrote: ... Interesting. I implemented an approach which boosted based on the number of months in the past, and after tweaking the boost amounts, it seems to do the job. I do a fresh reindex every night (since the indexing process takes no time at all... unlike our old search solution!) This sounds interesting, as I have been thinking of what's the best way to boost newer documents. Can you share some of your experience regarding boost values that seemed to make sense? In my case, CMS I'm working on stores support documentation for software/hardware, meaning that content is highly time-sensitive (ie. documents decay pretty quickly). Since the system is already doing both incremental reindexing, and nightly full reindexing (latter to make sure that even if temporarily some changed content was not [fully] reindexed, it eventually gets indexed properly), I can fairly easily add boosting I think. On a related note, it would also be nice if there was a way to start categorizing general hot topics for Lucene developers; it seems like there are about half a dozen areas where there's lots of interest for improvements (most of them related to ranking). If so, perhaps there could be more specific discussion groups, and also perhaps web pages summarizing some of discussions, consensus achieved, even if there's no code to show for it? -+ Tatu +- I read content for the index from different sources. Sometimes the source gives me documents loosely in date order, but not all of them. So, it seems that one of the other approaches should be taken (adding a month/week field etc). I should look more into the HitCollector and see how it can help me. The other issue I have is that I would like to prioritize the title field. At the moment I am lazy and add the title to the body (contents = title + body) which seems to be OK... however sometimes something that mentions the search term in the title should appear higher up in the pecking order. I am using the QueryParser (subclassed to disallow wildcards etc) to do the dirty work for me. Should I get away from this and manage the queries myself (and run a Multi against the title field as well as the contents? Thanks for the great feedback, Dion - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: inter-term correlation [was Re: Vector Space Model in Lucene?]
On Monday 17 November 2003 07:40, Chong, Herb wrote: i don't know what the Java implementation is like but the C++ one is very fast. ... I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Even if that implementation wasn't fast (which it should be), it should be fairly easy to implement it to be pretty much as efficient as any of basic tokenizers; ie. not much slower than full scanning speed over text data and token creation overhead. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Contributing to Lucene (was RE: inter-term correlation [was R e: Vector Space Model in Lucene?])
On Monday 17 November 2003 08:39, Chong, Herb wrote: the core of the search engine has to have certain capabilities, however, because they are next to impossible to add as a layer on top with any efficiency. detecting sentence boundaries outside the core search engine is really hard to do without building another search engine index. if i have to do that, there is no point in using Lucene. It's also good to know what exactly constitutes core; I would assume that analyzer implementations are not part per se, as long as core knows how to use analyzers. But as long as index structure has some way to store information needed (perhaps by using existing property of distances between tokens, which allows both overlapping tokens and gaps, like someone suggested?), core need not know specifics of how analyzers determine structural (sentence etc) boundaries. To me this seems like one of many issues where it's possible to retain distinction between Lucene kernel (lean mean core) and more specialized functionality; highlighting was another one. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: positional token info
On Tuesday 21 October 2003 17:31, Otis Gospodnetic wrote: It does seem handy to avoid exact phrase matches on phone boy when a stop word is removed though, so patching StopFilter to put in the missing positions seems reasonable to me currently. Any objections to that? So phone boy would match documents containing phone the boy? That Hmmh. WWGD (What Would Google Do)? :-) doesn't sound right to me, as it assumes what the user is trying to do. Wouldn't it be better to allow the user to decide what he wants? (i.e. phone boy returns documents with that _exact_ phrase. phone boy~2 also returns documents containing phone the boy). As long as phrase queries work appropriately with approximity modifiers, one alternative (from app standpoint) would be to: (a) Tokenize stopwords out, adding skip value; either one per stop word, or one for non-empty sequence of key words ( top of the world might make sense to tokenize as top - world, - signifying 'hole') (b) With phrase queries, first do exact match. (c) If number of matches is too low (whatever definition of low is), use phrase query match with slop of 2 instead. Tricky part would be to do the same for combination queries, where it's not easy to check matches for individual query components. Perhaps it'd be possible to create Yet Another Query object, that would, given a threshold, do one or two searches (as described above), to allow for self-adjusting behaviour? Or, perhaps there should be container query, that could execute ordered sequence of sub-queries, until one returns good enough set of matches, then return that set (or last result(s), if no good matches) and above-mentioned sloppy if need be phrase query would just be a special case? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hierarchical document
On Monday 20 October 2003 16:41, Erik Hatcher wrote: One more thought related to this subject - once a nice scheme for representing hierarchies within a Lucene index emerges, having XPath as a query language would rock! Has anyone implemented O/R or XPath-like query expressions on top of Lucene? Not me... but at some point I think I briefly mentioned that someone with extra time might want to do a very simple JDBC driver to be used with Lucene. Obviously it would be very minimal for queries (and might need to invent new SQL operators for some searches), but it could also expose metadata about index. Should be an interesting exercise at least. :-) Plus, if done properly, tools like DBVis could be used for simple Lucene testing as well. If so, who knows; perhaps that would make it even easier to do prototype implementations of Lucene replacing home-grown SQL-bound search functionalities of apps. Most of all above would just be a nice little hack, though. :-) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Struts logic iterate
On Monday 06 October 2003 08:35, Lars Hammer wrote: ... to iterate the Hits. I thought that Hits was an array of pointers to docs, ^^^ Actually, Hits contains a Vector (could be an array as well), but is not a Collection itself (one can not extend array classes in Java, so no Object besides basic arrays can be arrays or treates as one). Hits be made a Collection, though. In fact, I think it would be a reasonable thing to do, to make Hits be a simple Collection (or perhaps List since it is an ordered collection). You could file an RFE for this, or better yet, implement it. :-) I'd think including such patch for Lucene would make sense as well. Has anyone any experience in using the logic:iterate tag or is it necessary to write a custom JSP tag which does the iteration?? No, it should be enough to write a simple wrapper that implements Collection, and accesses Hits instance via next() method. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTML Parsing problems...
On Thursday 18 September 2003 14:50, Michael Giles wrote: I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very first page I tried to parse failed (http://www.theregister.co.uk/content/54/32593.htmlhttp://www.theregister .co.uk/content/54/32593.html). It seems to be choking on tags that are being written inside of JavaScript code (i.e. document.write('/scr' + 'ipt');. Obviously, the simple solution (that I am using with another parser) is to just ignore everything inside of script tags. It appears that the parser is ignoring text inside script tags, but it seems like it needs to be a bit smarter (or maybe dumber) about how it deals with this (so it doesn't get I would guess that often ignoring stuff in script (for indexing purposes) makes sense; exception being if someone wants to create HTML site creation IDE (like specifically wants to search for stuff in javascript sections?). Nonetheless HTML parser has to be able to handle these I think. confused by such occurrences). I see a bug has been filed regarding trouble parsing JavaScript, has anyone given it thought? I implemented a rather robust (X[HT])ML parser (QnD) that was able to work through many of such issues (script tag, unquoted single '' and '' chars, in attr values and elements, simplistic approach to optional end tags). Since it was dead-optimized for speed (anything fully in memory in a char array, optimizing based on that) I thought it might be useful for indexing (even more so than for its original purpose which was to be very fast utility for filtering [adding and/or removing stuff] of HTML pages). If anyone would be interested I could give the source code and/or (if I have time) to implement efficient fault-tolerant indexer. Like I said this also works equally well for well-formed XML, but that's nothing special. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene demo ideas?
On Wednesday 17 September 2003 07:07, Erik Hatcher wrote: On Wednesday, September 17, 2003, at 08:43 AM, Killeen, Tom wrote: I would suggest XML as well. Again, I'd like to hear more about how you'd do this generically. Tell me what the field names and values would correspond to when presented with an XML file. Perhaps just one generic content field, which would contain tokenized content from all XML segments. That could be done easily efficiently with just sax event handling? Since it's a simple demo, you can't get much simpler than that, but it should still be fairly useful? Attributes could/should be ignored by default; common practice for XML markup seems to be for attributes not to contain any content that would make sense to index. So I'd think just stripping out all tags (and comments, PIs etc) might be reasonable plain simple approach for demo app. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Keyword search with space and wildcard
On Friday 29 August 2003 10:02, Terry Steichen wrote: I agree. One problem, however, that new (and not-so-new) Lucene users face is a learning curve when they want to get past the simplest and most obvious uses of Lucene. For example, I don't think any of the docs mention the fact that you can't combine a phrase and a wildcard query. Other things that are obviously quite well understood by many members of the list, are still less-than-clear to others. For example, I found (and still find) it a bit difficult to find concrete examples/advice of how to get good benefit from filters. My whole point is that this is a *very* powerful and flexible technology. But I think it's often very difficult for those most experienced in using Lucene to fully appreciate how it looks from the newbie point of view. I agree completely. Perhaps I worded my reply badly; I didn't mean to sound hostile towards new users at all -- after all I consider myself to be one (I just happened to work on simple improvements to QueryParser and learnt how it works). I wish documentation was more complete; perhaps some section could list common workarounds or insights. And perhaps incompatibility of phrase and wild card queries could be added to document that lists current limitations. I guess the reason I think it's valuable to document the flexibility of query construction is that I have been working on something similar (although working with database queries) in a system I'm working on, and I have also seen systems that have query syntax that's too intertwined with backend implementation (for example, while Hibernate is a good ORM, its queries don't seem to have backend independent intermediate representation... which makes it hard to develop different kinds of backends). So, it's useful to know that there are 2 levels of interfaces to Lucene's query functionality. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: 2,147,483,647 max documents?
On Monday 11 August 2003 01:07, Kevin A. Burton wrote: Why was an int chosen to represent document handles? Is there a reason for this? Why wasn't a long chosen to represent document handles? 64 bits seems like the obvious choice here except for a potentially bloated datastore (32 extra bits) I can't speak for actual reasons (not being core Lucene developer), but the general benefits of 32-bit ints vs. longs are: - Better performance on pretty much any current architecture (even so-called 64-bit CPUs often prefer 32-bit data access, and 64-bit representations are more important for addressing). Also, smaller data set size is usually also good for performance (caching). - Atomicity of access (read access can often be done without synchronizing); longs can not be atomically accessed in Java. Another question is whether limited address space presents a real problem. Since Lucene can reuse doc ids (or rather, there is not persistent id per se? doc id is just an index, and holes left by removed docs can be reused?), perhaps this is usually not much of an issue? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: interesting phrase query issue
On Thursday 17 July 2003 07:20, greg wrote: I have several document sections that are being indexed via the StandardAnalyzer. One of these documents has the line access, the manager. When searching for the phrase access manager, this document is being returned. I understand why (at least i think i do), because a stop word is the and the , is being removed by the tokenizer, my question is is there any way I can avoid having this returned in the results? My thoughts were to create a new analyzer that indexes the word the (blick to many of those), or index the , in some way (also not good). Any suggestions? You can also replace all stop words with dummy token ( might be an ok candidate?). That would be similar to indexing the (which probably is better idea than indexing ,). I'm planning to do something similar for paragraph breaks (in case of plain text, double linefeed, for HTML p etc), to prevent similar problems. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiuser environments
On Monday 14 July 2003 08:52, Guilherme Barile wrote: Hi I'm writing a web application which will index files using textmining to extract text and lucene to store it. I do have the following implementation questions: 1) Only one user can write to an index at each time. How are you people dealing with this ? Maybe some kind of connection pooling ? Two obvious candidates are locking bottleneck methods and doing index writing in a critical section, or having a background thread that does reindexing, and other threads add requests to a queue. In CMS I'm working we are doing the latter (so as not to block actual request threads which could happen with first approach, adding/deleting documents is done as post-processing when documents are created/edited/deleted). In either case you usually have a singleton instance that represents the search engine functionality (assuming single index), and from there on it's reasonably easy to reuse IndexReader as necessary. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: commercial websites powered by Lucene?
On Wednesday 25 June 2003 09:47, Ulrich Mayring wrote: John Takacs wrote: I'd love to try Lucene with the above, but the Lucene install fails because of JavaCC issues. Surprised more people haven't encountered this problem, as the install instructions are out of date. Well, what do you need JavaCC for? Isn't it just the technology for building the supplied HTML-Parser? There are much better HTML parsers out there, which you can use. On a related note; has anyone done performance measurements for various HTML parsers used for indexing? I have written couple of XML/HTML parsers that were optimized for speed (and/or leniency to be able to handle/fix non-valid documents), and was wondering if they might be useful for indexing purposes for other people (one is in general pretty optimal if document contents are fully in memory already, like when fetching from DB; another uses very little memory, while being only slightly slower). However, using those as opposed to more standard ones would only make sense if there are significant speed improvements. And to do that, it would be good to have baseline measurements, and/or to know what are current best candidates, from performance perspective. The thing is that creating a parser that only cares about textual content (and perhaps in some cases about surrounding element, but not about attributes, or structure, or DTD/Schema, validity etc) is fairly easy, and since indexing is often the most CPU-intensive part of search engine, it may make sense to try to optimize this part heavily, up to and including using specialized parsers. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Weighted Search by Field using MultiFieldQueryParser
On Tuesday 17 June 2003 05:43, Kevin L. Cobb wrote: I have an index that has three fields in it. When I do a search using MultiFieldQueryParser, the search applies the same importance (weight) to each of the fields. BUT, what if I want to apply a different weight to each field, i.e. I want to consider found terms from certain fields as less important than others. I have applied an algorithm to help me do this, which involves searching each field separately and then recombining the results into a single collection, but hate to reinvent the wheel if I don't have to. Have you looked at MultiFieldQueryParser source? It's a very simple class, and modifying it (making a new class) should be easy; pass in not only field names but also weights to apply? (as a sidenote, MultiFieldQueryParser does some unnecessary work as is... it seems to re-parse same query once for each field, could just clone it) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lowercasing wildcards - why?
On Friday 30 May 2003 09:55, Leo Galambos wrote: Ah, I got it. THX. In the good old days, the wildcards were used as a fix for missing stemming module. I am not sure if you can combine these two opposite approaches successfully. I see the following drawbacks of your solution. Example: built* (-built) could be changed to build* (no built, but -builder, building, etc.), and precision will go down drastically. You probably use a stemmer with one important bug (a.k.a. feature) - overstemming, so here is another example: political* (-political, politically) is transformed to polic* (-policer, policy, policies, policement etc.) by Porter alg., and the precision is again affected drastically Yes, this is the exact problem that was brought up last time this was discussed. It may not be a very common problem (most of the time stemming a wildcard part probably works ok, somebody had tried that), but still a potential one. And that's why default lower casing was added, as it solved one of FAQs. It is much more common that analyzer used for non-wildcard query does lower casing, than not, and thus default setting (which leads to having to turn feature off by some users) seems to make sense. More general problem then is that there's no real way to stem foo?ar, or any non-prefix wildcard query, but that could be figured out by QueryParser if necessary. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard workaround
On Wednesday 28 May 2003 05:43, David Medinets wrote: - Original Message - From: Andrei Melis [EMAIL PROTECTED] As far as I have understood, lucene does not allow search queries starting with wildcards. I have a file database indexed by content and also by filename. It would be nice if the user could perform a usual search like *.ext. Does anyone know if Oracle patented the technique that they use for *ext searching in the Oracle Text product. If not, I'm sure the technique can be borrowed. On the other hand, the slow technique of comparing each term to *.ext can certainly be implemented with a minimum of effort, I think. [apologies if somebody else already pointed this out... I missed some mails to the list from yesterday] One of the most interesting solutions somebody posted earlier, was to use 2 indexes; one for 'normal' searches, with normal analyzer etc, and second one that uses reversed words; ie. analyzer reverses words tokenized by standard analyzer. This second index would then allow for searches to do prefix match, in this case query would be something like reverse_field:txe.* This would work efficiently, although pretty much double the size of index for content that has to be prefix-searchable. Still, this solution somehow appeals to my hacker side. :-) In this specific case, though, what others have suggested (add file prefix as separate field to search on), is probably more practical. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer Incorrect?
On Friday 04 April 2003 05:24, Rob Outar wrote: Hi all, Sorry for the flood of questions this week, clients finally started using the search engine I wrote which uses Lucene. When I first started Yup... that's the root of all evil. :-) (I'm in similar situation, going through user acceptance test as we speak... and getting ready to do second version that'll have more advanced metadata based search using Lucene). developing with Lucene the Analyzers it came with did some odd things so I decided to implement my own but it is not working the way I expect it to. First and foremost I would like to like to have case insensitive searches and I do not want to tokenize the fields. No field will ever have a space If you don't need to tokenize a field, you don't need an analyzer either. However, to get case insensitive search, you should lower-case field contents before adding them to document. QueryParser will do lower casing for search terms automatically (if you are using it), so matching should work fine then. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildcard searching - Case sensitiv?
On Friday 28 March 2003 08:37, [EMAIL PROTECTED] wrote: Ok, thanks Otis, you have to write the terms lowercase when you're searching with wildcards. Or use the set method in QueryParser to ask it to automatically lower case those terms. Patch for that was added before 1.3RC1 (check javadocs or source for exact method to call). I think default was not to enable this feature, for backwards compatibility (unless Otis changed it as was suggested?). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Alternate Boolean Query Parser?
On Friday 28 March 2003 15:48, Shah, Vineel wrote: One of my clients is asking for an old-style boolean query search on my keywords fields. A string might look like this: oracle admin* and java and oracle and (8.1.6 or 8.1.7) and (solaris or unix or linux) There would probably be need for nested parenthesis, although I can't think of an example. Is there a parser I can plug into lucene to make this happen? It doesn't seem like the normal QueryParser class would like this string, or would it? Any ideas or comments would be appreciated. Making my Actually I think it should, as long as you change 'and' to 'AND' and 'or' to 'OR' (upper case versions are used, I think, to make it less likely user meant to match words 'and' and 'or'?). own grammar and parser class is too expensive a proposition. Well, writing simple grammar and parser is fairly easy to do, if you've ever used java_cup or javacc (or just (b)yacc / bison), shouldn't take all that long since all actual query classes already exist. But I don't think you need to do even that. :-) The only feature that might need some additional work is matching oracle admin*; PhrasePrefixQuery allows doing something like that, but it's not integrated with QueryParser (I think it probably should, and might be quite easy to do). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: org.apache.lucene.demo.IndexHTML - parse JSP files?
On Monday 24 March 2003 18:03, Michael Wechner wrote: John Bresnik wrote: anyone know of a quick and easy way to get this demo [org.apache.lucene.demo.IndexHTML] to parse JSP files as well? I used to a crawler to create a local [static] version of the site [i.e. they are not longer JSP files just the html output from the original JSP file - but in the interest of keeping the URL intact, I need to parse the JSP extentions - the short question is, does anyone know of a way to *not* ignore the *.jsp files? just modify IndexHTML: there is one line in there which decides what extension it will index. There is another question I was wondering; since JSP is not XML (ie. can not be reliably parse using an XML or even HTML parser [or for that matter, even with simplest XML markup tokenizer that ignores nesting], needs a lower level scanner), has anyone tried connecting an actual JSP processor to Lucene? Or writing a simple one just meant for indexing, without having to execute code embedded? [the problem with JSP compared to XML is that it need not nest properly with HTML content around; one can use JSP inside attribute values, for example; thus, first JSP has to be processed to HTML, and then HTML needs to be further tokenized] Jakarta has to have at least one such processor (haven't looked at whether there's a separate component or if Tomcat just has one embedded?). Of course parsing JSP is problematic in many ways, not just getting jsp tagging out; dynamic portions probably just have to be ignored, and all text inside included (except for things inside comments). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Create my own Analyzer...
On Friday 21 March 2003 03:55, Pierre Lacchini wrote: Heya, as u can see, I want to create my own french Analyzer, using the snowball's FrenchStemmer... But i don't really know how to proceed... Does anyone know where I can find a tutorial, or a clear example of How to create an analyzer ?? Sorry for all those noob questions, but as i said, i'm kinda noob in java ;) Well, analyzer classes are about as simple as it gets, so perhaps try to look default analyzers Lucene core comes with (under org.apache.lucene.analysis)? (the only slightly more advanced one is StandardAnalyzer as it uses javacc) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multiple collections indexing
On Wednesday 19 March 2003 01:44, Morus Walter wrote: ... Searches must be able on any combination of collections. A typical search includes ~ 40 collections. Now the question is, how to implement this in lucene best. Currently I see basically three possibilities: - create a data field containing the collection name for each document and extend the query by a or-combined list of queries on this name filed. - create an index per collection and use a MultiSearcher to search all interesting indexes. - (a third on I just discovered): create a data field containing a marker for each collection x10... for the first collection x01... for the second x001000... for the third and so on. The query might use a wildcard search on this field using x?0?0... specifying '?' for each collection that should be searched on, and '0' for the others. The marker would be very long though (the number of collections is growing, so we have to keep space for new one also). This might still be a feasible thing to do, except if number of collections changes very frequently (as you need to reindex all docs, not just incremental). Another possibility would be to have a new kind of Query; one to use with numeric field values (probably would be easiest to use hex numbers). In a way it'd be a specialized/optimized version of WildcardQuery. For example, one could define required bit pattern after ORing field value with mask (in your case you'd use one bit per type, and require non-interesting type flags to be zeroes, knowing that then at least one other bit, matching interesting type, is one). Implementing this would be fairly easy; first find the range (like RangeQuery does), and iterate over all existing terms in that range, and for each match against bit pattern, and add term if it matches the pattern. Actual search would then search pretty much like prefix, wildcard or range query, as Terms at that point have been expanded and search part need not care how they were obtained. This would make representation more compact (4 bits in a char instead of one), potentially making index bit smaller (which usually also means faster). And of course if you really want to push the limit, you could use even more efficient encoding (although, assuming indexes use UTF-8, base64 might be almost as efficient as it gets, as ascii chars only take one byte whereas upper chars take anywhere from 2 to 7 [for unicode-3? 4 for UC2] bytes). Adding such a query would need to be done outside QueryParser (as length of bitfield field would be variable), but in your case that probably shouldn't be a problem? Anyway, just an idea I thought might be worth sharing, -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser and compound words
On Thursday 13 March 2003 00:52, Magnus Johansson wrote: Tatu Saloranta wrote: ... But same happens during indexing; fotbollsmatch should be properly split and stemmed to fotboll and match terms, right? Yes but the word fotbollsmatch was never indexed in this example. Only the word fotboll. I want a query for fotbollsmatch to match a document containing the word fotboll. Ok I think I finally understand what you meant. :-) So, basically, in your case you would prefer getting query: fotbollsmatch to expand to (after stemming etc): fotboll match and not fotboll match So that matching just one of the words would be enough for a hit (either either of or just first word or just last word). It would be possible to implement this functionality by overriding default QueryParser and modifying its functionality slightly. In QueryParser you should be able to override default handling for terms, so that whenever you get just single token (in this case fotbollsmatch) that expands to multiple Terms, you do not construct a phrase query, but just BooleanQuery with TermQueries (look at getFieldQuery(); it handles basic search terms). You may need to use simple heuristics for figuring when you have white space(s) that indicate normal phrases, which probably should still be handled using PhraseQuery. Of course this is all assuming you still do want that functionality. :-) And if you do, it would be good idea to get patch back in case someone else finds that useful later on (I think many non-english languages have concept of compound words; German and Finnish at least do). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding Setup Lucine for my site
On Wednesday 05 March 2003 13:35, Leo Galambos wrote: I'm all eyes and I'm a serious grown-up with good manners :) Constructive suggestions for improvement are always welcome. First a disclaimer: I don't mean to sound too negative. I'm genuinely curious about many of the issues you mention. But I'm not sure I really understand them. :-) 1. 2 threads per request may improve speed up to 50% Hmm? Could you clarify? During indexing, multithreading may speed things up (splitting docs to index in 2 or more sets, indexing separately, combining indexing). But... isn't that a good thing? Or are you saying that it'd be good to have multi-threaded search functionality for single search? (in my experience searching is seldom the slow part) 2. Merger is hard coded In a way that is bad because... ? (ie. what is the specific problem... I assume you mean index merging functionality?) ... 4. you cannot implement dissemination + wrappers for internet servers which would serve as static barrels. Could you explain this bit more thoroughly (or pointers on longer explanation)? 5. Document metadata cannot be stored as a programmer wants, he must translate the object to a set of fields Yes? I'd think that possibility of doing separate fields is a good thing; after all, all a plain text search engine needs to provide (to be considered one) is indexing of plain text data, right? Plus, Lucene is not a Content Management System (or database), but content indexing system. As such I'm not sure why storage should not be optimized to allow for fast searches (which means flattening contents, amongst other things). That is not to say that things couldn't be improved; it might be a good idea to define small set of base interfaces / classes to make it easier to convert from 'objectified' textual data to straight-forward indexing. FWIW I am actually using Lucene for storing documents that have extensive metadata associated, and I don't find restrictions too bad... but that's certainly matter of taste. :-) 6. Lucene cannot implement your own dynamization (sorry, I must sound real thick here). Could you elaborate on this... what do you mean by dynamization? -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: How is that possible ?
On Friday 28 February 2003 05:15, Alain Lauzon wrote: At 07:16 2003-02-28 +0100, you wrote: May it be, that microsoft is found, because the search is not case sensitive (text) and ct is not found because there the search is case sensitive (Keyword) Did you try +state:CT +company:microsoft~10 ^^ ? I don't thnik so because the StandardAnalyzer will put everything in lowercase. I will try without the StandardAnalyzer. Yes, but only fields that are tokenizable. Keywords are not touched, they are indexed as is. So if 'state' field is a keyword field, it would be stored in upper case (this is explained in FAQ). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexWriter addDocument NullPointerException
On Friday 21 February 2003 13:22, Günter Kukies wrote: Hello, I don't have any line number. You unfortunately do need to know the line number, if you do get an exception and try to see where it occurs. Another less frequent problem is that you actually get the exception as an object and print out that exception; in that case you would just see java.lang.NullPointerException, and nothing else? Otherwise, based on your code, you should see a stack trace, with or without line numbers. But you would at least see the method call stack, which would help in figuring out where problem occured. However, if you do catch an exception, and stack trace doesn't have line numbers (it seems that some JVMs do not have line number info available when running JIT'ed code) there are basically two ways to figure out exact location: (1) Try to make JVM get the line number info (either running in interpreted mode; I think there was option, something like '-Djava.compiler= ' to disable JIT?) (2) Run code in a debugger. One nice free debugger (if you are not using an IDE that has one is JSwat: http://www.bluemarsh.com/java/jswat/ Hope this helps, -+ Tatu +- this is the code snippet: Document doc; IndexWriter writer; . try{ writer.addDocument(doc); } catch(Exception ex){ ex.printStackTrace(); } this is the output on Standard.out: java.lang.NullPointerException and nothing more. The doc is not null and System.out.println(doc) seems to be ok. There is no difference between the working 80% and the not working 20% doc's. Thanks, Günter On Friday 21 February 2003 05:33, Günter Kukies wrote: Hello, writer.addDocument(doc) is throwing an NullPointerException. The stacktrace from the catched Exception is only one line NullPointerException without anything else. I open the IndexWriter with create true. Run over the files in a Directory and add all found documents. After that i close the indexwriter. 80% of the documents were added without problems. The rest gets that NullPointerException. Any Ideas? Perhaps look at the line where the null pointer exception is thrown and see what happens? NullPointerException is thrown when a null reference is being de-referenced. Seeing the immediate cause should be easy, given line number. Perhaps you have added a field with null value? (just a guess, I don't know if that's even illegal). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Number range search through Query subclass
On Friday 14 February 2003 02:58, Volker Luedeling wrote: Hi, I am writing an application that constructs Lucene searches from XML queries. Each item from the XML is represented by a Query of the corresponding type. I have a problem when I try to search for number ranges, since RangeQuery compares strings, not numbers, so 15 155 20. What I need is a subclass of Query that evaluates numbers correctly. I have tried subclassing RangeQuery, MultiTermQuery or Query directly, but each time I have run into problems with inheritance and access rights to various methods or inner classes. Does anyone know of a solution to this problem? If there is none, the only way I can think of would be indexing numbers as something like #15#. But it's not a very elegant solution when all I need is a slight variation of one existing class. Thanks for any help you can offer, Actually the problem is not (just) the query, it's tokenizer/analyzer/indexer as well. For range query to work, tokens have to be correctly ordered lexically (~= in alphabetic order). I don't think using #s as markers would work, as they do not make tokens get ordered properly (plus, most analyzers would just remove those chars). The usual way to do this is to use suitable numeric format for indexed data; for dates format like -MM-DD works ok (ie. dates are correctly ordered when ordering date tokens alphabetically), for other numbers (like timestamps) what is usually done is padding, so that numbers in your case could be 015, 155 and 20 (instead of leading 0 any other letter that is before '1' in alphabetic order would do). So, you need to know biggest number you'd need to index and use appropriate zero padding. Now, if you store these numbers as single values in separate index, padding is easy to do. If you are trying to get random numeric data contained in otherwise plain text content, things are bit more complicated. Hope this helps, -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemoryException while Indexing an XML file
On Friday 14 February 2003 07:27, Aaron Galea wrote: I had this problem when using xerces to parse xml documents. The problem I think lies in the Java garbage collector. The way I solved it was to create It's unlikely that GC is the culprit. Current ones are good at purging objects that are unreachable, and only throw OutOfMem exception when they really have no other choice. Usually it's the app that has some dangling references to objects that prevent GC from collecting objects not useful any more. However, it's good to note that Xerces (and DOM parsers in general) generally use more memory than the input XML files they process; this because they usually have to keep the whole document struct in memory, and there is overhead on top of text segments. So it's likely to be at least 2 * input file size (files usually use UTF-8 which most of the time uses 1 byte per char; in memory 16-bit unicode-2 chars are used for performance), plus some additional overhead for storing element structure information and all that. And since default max. java heap size is 64 megs, big XML files can cause problems. More likely however is that references to already processed DOM trees are not nulled in a loop or something like that? Especially if doing one JVM process for item solves the problem. a shell script that invokes a java program for each xml file that adds it to the index. -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: % of Relevance
On Tuesday 11 February 2003 07:48, Nellai wrote: Hi! can anyone tell me how to calculate the % of relevance using Lucene. Lucene's hit score is normalized float, ] 0.0, 1.0 ] (since 0.0 ones are never included). From there it's basic arithmetics (perhaps this could be included in FAQ , even though it is fairly trivial). The simples way would be: ... // get the search results, float score = hits.score(docNr); // between 0.0 and 1.0 (including 1.0) int pctScore = (int) (100.0f * score); Also note that it's not guaranteed that all searches have any 100% matching docs; for example when none of the docs matches all clauses, and clauses are combined with OR-query. Same may also happen (I think?) if best match for different sub-clauses is different? You may also want to normalize the score if you always want your top match to be 100% (or have some range that gets rounded up)... users are known to want silly features like that. :-) -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: '-' character not interpreted correctly in field names
On Monday 03 February 2003 07:19, Terry Steichen wrote: I believe that the tokenizer treats a dash as a token separator. Hence, the only way, as I recall, to eliminate this behavior is to modify QueryParser.jj so it doesn't do this. However, doing this can cause some other problems, like hyphenated words at a line break and the like. It might be enough to just replace analyzer passed in to QueryParser to do this? This is the case if QueryParser only handles modifiers outside terms, and terms are passed to analyzer. I think this is the case (QueryParser does call the analyzer in couple of places, and one word may actually expand to a phrase or vice versa)? Still, it seems like using a hyphen as separator shouldn't necessarily cause big problems when indexer does the same; queries against 2 - 5 would be phrase queries for 2 5, which is still reasonably specific (and should match the content). On the other hand, simple analyzer and standard analyzer have pretty different tokenization rules, so it's important to make sure same analyzer is used for both indexing and searching (that mismatch can prevent matches easily). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Wildchar based search?? |
On Saturday 01 February 2003 00:19, Otis Gospodnetic wrote: 1) to what extent are wildcards supported by lucenes? You can use * and ? the way they usually are used. I think there was one exception; first character of a simple term can not be a wildcard? (this from query syntax page). -+ Tatu +- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Range queries
On Wednesday 22 January 2003 07:49, Erik Hatcher wrote: Unfortunately I don't believe date field range queries work with QueryParser, or at least not human-readable dates. Is that correct? I think it supports date ranges if they are turned into a numeric format, but no human would type that kind of query in. I'm sure supporting true date range queries gets tricky with locale issues and such too. Right. In my case that's ok -- the documents I'll be indexing are hybrid documents, with some structured/plain text content and additional metadata (in DB normalized form). Thus the dates (from normalized metadata fields) can easily be converted to numeric form and indexed (for things like last modified etc that'd be normally searched via DB). The other part (UI) needs more work... either need to add a new quoting mechanism for dates (or just do that for if certain field prefix is used), or (more likely) the UI will use simple web forms for constructing query. Thanks to everyone for quick replies, -+ Tatu +- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: Range queries
On Wednesday 22 January 2003 08:27, Michael Barry wrote: I utilize the earlier version and queries such as this work fine with QueryParser: field:[ 20030120 - 20030125 ] of course the back-end indexer canonocalizes all date fields to MMDD. The front-end search code is responsible for canonocalizing the user inputed dates to MMDD. I think the key here would be either to not allow users to enter free-form dates (provide some type of UI element to enter year, month, day seperately) or give some copy stating dates should be in MMDD format. Thanks, this is along the lines I was thinking too. -+ Tatu +- -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Range queries
My apologies if this is a FAQ (which is possible as I am new to Lucene, however, I tried checking the web page for the answer). I read through the Query syntax web page first, and then checked the matching query classes. It seems like query syntax page is missing some details; the one I was wondering about was the range query. Since query parser seems to construct these queries, I guess they have been implemented, even though syntax page didn't explain them. Is that correct? Looking at QueryParser, it seems that inclusive range query uses [ and ], and exclusive query { and }? Is this right? And does it expect exactly two arguments? Also, am I right in assuming that range uses lexiographic ordering, so that it basically includes all possible words (terms) between specified terms (which will work ok with numbers/dates as long as they have been padded with zeroes or such)? Another question I have is regarding wildcard search. Page mentions that there is a restriction that search term can not start with a wild card (as that would render index useless I guess... would need to full scan?). However, it doesn't mention if multiple wildcards are allowed? All the example cases just have single wild card? Sorry for the newbie questions, -+ Tatu +- ps. Thanks for the developers for the neat indexing engine. I am currently evaluating it for use in a large-scale enterprise content management system. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]