Re: lucene query (sql kind)
I like your idea and think you are quite right. I see quite some people are using lucene to the extreme such that relational database functionalities are replaced by lucene. However, storing everything in lucene and use it as a relational type of database will be kind of re-inventing the wheel. For example, sorting on the date field, and any other range query. I think the better way is to look at ways to integrate lucene tightly into a java relational database, such as HSQL, McKoi or Derby. In particular, that integration would make it possible for queries like contains(...), which is included in MySQL full text search syntax and other major relational db vendors. I would like to contribute any possible help I could for that to happen. Thanks, Jian On Fri, 28 Jan 2005 13:01:40 + (GMT), mark harwood [EMAIL PROTECTED] wrote: I've added some user-defined lucene functions to HSQLDB and I've been able to run queries like the following one: select top 10 lucene_highlight(adText) from ads where pricePounds 200 and lucene_query('bass guitar drums',id)0 order by lucene_score(id) DESC I've had similar success with Derby (Cloudscape). This approach has some appeal and I've been able to use the same class as a UDF in both databases but it does have issues: it looks like this UDF based integration won't scale. The above query took 80 milliseconds using 10,000 records. Another index/database with 50,000 records was taking a matter of seconds. I think a scalable integration is likely to require modification of the core RDBMS code. I think it is worth considering developing such a tight RDBMS integration if you consider the issues commonly associated with using Lucene: 1) Sorting on float/date fields and associated memory consumption 2) Representing numbers/dates in Lucene (eg having to pad with sufficent leading zeros and add to index's list of terms) 3) Retrieving only certain stored fields from a document (all storage can be done in db) 4) Issues to do with updating volatile data eg price data used in sorts 5) Manually coding joins with RDBMS content as custom filters 6) Too-many terms exceptions produced by range queries 7) Grouping results eg by website 8) Boosting docs based on stored content eg date I'm not saying there aren't answers to the above using Lucene. However,I do wonder if these can be addressed more effectively in a project which seeks tighter integration with an RDBMS and leveraging its capabilities. Any one else been down this route? ___ ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
google mini? who needs it when Lucene is there
Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: google mini? who needs it when Lucene is there
Overall, even if google mini gives a lot of cool features compared to a bare-born lucene project, what is good with the 50,000 documents limit. It is useless with that limit. That is just their way of trying to turn it into another cash cow. Jian On Thu, 27 Jan 2005 17:45:03 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: 500 times the original data? Not true! :) Otis --- Xiaohong Yang (Sharon) [EMAIL PROTECTED] wrote: Hi, I agree that Google mini is quite expensive. It might be similar to the desktop version in quality. Anyone knows google's ratio of index to text? Is it true that Lucene's index is about 500 times the original text size (not including image size)? I don't have one installed, so I cannot measure. Best, Sharon jian chen [EMAIL PROTECTED] wrote: Hi, I was searching using google and just found that there was a new feature called google mini. Initially I thought it was another free service for small companies. Then I realized that it costs quite some money ($4,995) for the hardware and software. (I guess the proprietary software costs a whole lot more than actual hardware.) The nice feature is that, you can only index up to 50,000 documents with this price. If you need to index more, sorry, send in the check... It seems to me that any small biz will be ripped off if they install this google mini thing, compared to using Lucene to implement a easy to use search software, which could search up to whatever number of documents you could image. I hope the lucene project could get exposed more to the enterprise so that people know that they have not only cheaper but more importantly, BETTER alternatives. Jian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestions for documentation or LIA
Hi, Just to continue this discussion. I think right now Lucene's retrieval algorithm is based purely on Vector Space Model, which is simple and efficient. However, there maybe cases where folks like me want to use another set of completely different ranking algorithms, those which do not even use tf/idf. For example, I am thinking about adding Cover Density ranking algorithm to lucene, which is for now purely based on the proximity information and does not require any global ranking variables. But looking into the lucene code, it seems not very easy to make a hack for that. At least, for me, a novice lucene user. I read on the lucene whiteboard 2.0 that lucene will accomodate more in terms of what to be indexed and such. That move might be good for implementing other or ad hoc ranking algorithms. Cheers, Jian On Wed, 26 Jan 2005 10:25:15 -0500, Ian Soboroff [EMAIL PROTECTED] wrote: Erik Hatcher [EMAIL PROTECTED] writes: By all means, if you have other suggestions for our site, let us know at [EMAIL PROTECTED] One of the things I would like to see, but which isn't either in the Lucene site, documentation, or Lucene in Action, is a complete description of how the retrieval algorithm works. That is, how the HitCollector, Scorers, Similarity, etc all fit together. I'm involved in a project which to some degree is looking at poking deeply into this part of the Lucene code. We have a nice (non-Lucene) framework for working with more different kinds of similarity functions (beyond tf-idf) which should also be expandable to include query expansion, relevance feedback, and the like. I used to think that integrating it would be as simple as hacking in Similarity, but I'm beginning to think it might need broader changes. I could obviously hook in our whole retrieval setup by just diving for an IndexReader and doing it all by hand, but then I would have to redo the incremental search and possibly the rich query structure, which would be a lose. So anyway, I got LIA hoping for a good explanation (not a good Explanation) on this bit, but it wasn't there. There are some hints on the Lucene site, but nothing complete. If I muddle it out before anything gets contributed, I'll try to write something up, but don't expect anything too soon... Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Suggestions for documentation or LIA
Hi, Ian, Thanks for your information. It would be really helpful to have some documentation maybe on the WIKI about retrieval algorithm and how to hack it. At least, something there even if like several paragraphs to get started... Thanks, Jian On Wed, 26 Jan 2005 12:40:54 -0500, Ian Soboroff [EMAIL PROTECTED] wrote: jian chen [EMAIL PROTECTED] writes: Just to continue this discussion. I think right now Lucene's retrieval algorithm is based purely on Vector Space Model, which is simple and efficient. As I understand it, it's indeed a tf-idf vector space approach, except that the queries are structured and as such, the tf-idf weights are totaled as a straight cosine among siblings of a BooleanQuery, but other query nodes may do things differently, for example, I haven't read it but I assume PhraseQueries require all terms present and adjacent to contribute to the score. There is also a document-specific boost factor in the equation which is essentially a hook for document things like recency, PageRank, etc etc. You can tweak this by defining custom Similarity classes which can say what the tf, idf, norm, and boost mean. You can also affect the term normalization at the query end in BooleanScorer (I think? through the sumOfSquares method?). We've implemented something kind of like the Similarity class but based on a model which decsribes a larger family of similarity functions. (For the curious or similarly IR-geeky, it's from Justin Zobel's paper from a few years ago in SIGIR Forum.) Essentially I need more general hooks than the Lucene Similarity provides. I think those hooks might exist, but I'm not sure I know which classes they're in. I'm also interested in things like relevance feedback which can affect term weights as well as adding terms to the query... just how many places in the code do I have to subclass or change? It's clear that if I'm interested in a completely different model like language modeling the IndexReader is the way to go. In which case, what parts of the Lucene class structure should I adapt to maintain the incremental-results-return, inverted list skips, and other features which make the inverted search fast? Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to give recent documents a boost?
Hi, I think setting boost to the recent document is tricky. There is no clear cut except trial and error to make the boost value right. Could you let the user specify a date range and sort the documents within that range by relevance? This way, the users get what they exactly specified, and won't be annoyed by in-proper setting of the boost factor. Workable? Thanks, Jian On Tue, 25 Jan 2005 10:30:21 -0800, aurora [EMAIL PROTECTED] wrote: What is the best way to give recent documents a boost? Not sorting them by strict date order but to give them some preference. If document 1 filed last week has a score of 0.5 and document 2 filed last month has a score of 0.55, then list document 1 first. But if document 1 has a score of only 0.05, then keep it at the end. Any experience of fine tuning by date order? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Hi, If it is really the case that every 128th term is loaded into memory. Could you use a relational database or b-tree to index to do the work of indexing of the terms instead? Even if you create another level of indexing on top of the .tii fle, it is just a hack and would not scale well. I would think a b/b+ tree based approach is the way to go for better memory utilization. Cheers, Jian On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action
Hi, I am not sure. However I see that the book has an electronic version you can buy online... Cheers, Jian On Sun, 23 Jan 2005 10:30:24 +0800, ansi [EMAIL PROTECTED] wrote: hi,all Does anyone know how to buy Lucene in Action in China? Ansi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Newbie: Human Readable Stemming, Lucene Architecture, etc!
Hi, One thing to point out. I think Lucene is not using LSI as the underlying retrieval model. It uses vector space model and also proximity based retrieval. Personally, I don't know much about LSI and I don't think the fancy stuff like LSI is workable in industry. I believe we are far away from the era of artificial intelligence and using any elusive way to do information retrieval. Cheers, Jian On Thu, 20 Jan 2005 14:50:10 -0700, Owen Densmore [EMAIL PROTECTED] wrote: Hi .. I'm new to the list so forgive a dumb question or two as I get started. We're in the midst of converting a small collection (1200-1500 currently) of scientific literature to be easily searchable/navigable. We'll likely provide both a text query interface as well as a graphical way to search and discover. Our initial approach will be vector based, looking at Latent Semantic Indexing (LSI) as a potential tool, although if that's not needed, we'll stop at reasonably simple stemming with a weighted document term matrix (DTM). (Bear in mind I couldn't even pronounce most of these concepts last week, so go easy if I'm incoherent!) It looks to me that Lucene has a quite well factored architecture. I should at the very least be able to use the analyzer and stemmer to create a good starting point in the project. I'd also like to leave a nice architecture behind in case we or others end up experimenting with, or extending, the system. So a couple of questions: 1 - I'm a bit concerned that reasonable stemming (Porter/Snowball) apparently produces non-word stems .. i.e. not really human readable. (Example: generate, generates, generated, generating - generat) Although in typical queries this is not important because the result of the search is a document list, it *would* be important if we use the stems within a graphical navigation interface. So the question is: Is there a way to have the stemmer produce english base forms of the words being stemmed? 2 - We're probably using Lucene in ways it was not designed for, such as DTM/LSI and graphical clustering and navigation. Naturally we'll provide code for these parts that are not in Lucene. But the question arises: is this kinda dumb?! Has anyone stretched Lucene's design center with positive results? Are we barking up the wrong tree? 3 - A nit on hyphenation: Our collection is scientific so has many hyphenated words. I'm wondering about your experiences with hyphenation. In our collection, things like self-organization, power-law, space-time, small-world, agent-based, etc. occur often, for example. So the question is: Do folks break up hyphenated words? If not, do you stem the parts and glue them back together? Do you apply stoplists to the parts? Thanks for any help and pointers you can fling along, Owenhttp://backspaces.net/http://redfish.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]