RE: Lucene applicability
Hi, Thank you all for your time to answer my questions! However there are a few more issues which are not quite clear yet and hope to get advice on those too: 1.) How is the index maintained? In another product where we use an indexer different from Lucene, we got one central index and a few JBoss servers all accessing the same index. So how does Lucene handle synchronization between multiple threads (different JVMs)? How does it maintain the index after update/delete operations on the database? 2.) Is the index always up-to-date? In the FAQs it says we have to re-open the IndexReader periodically ... how expensive (in computational terms) is it to do that on every request for instance? 3.) I'm still not sure about performance. According to the FAQs we need to build our own MultiPhraseQuery parser to support multiple terms and wildcards. For example consider 50.000.000 documents, 50.000 of them match term T1 in category A, 50.000 match term T2 in category B and 1.000.000 match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How fast is the algorithm in this case? Who guarantees that it doesn't start at the 1 million side? Cheers, w -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Donnerstag, 26. August 2010 05:25 To: java-user@lucene.apache.org Subject: Re: Lucene applicability A stepping stone to the above is that, in DB terms, a Lucene index is only one table. It has a suite of indexing features that are very different from database search. The features are oriented to searching large bodies of text for "ideas" rather than concrete words. It searches a lot faster than a DB. It also spends more time creating its various indexes than a DB. Other points- you can't add or drop fields or indexes. On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson wrote: > The SOLR wiki has lots of good information, start there: > http://wiki.apache.org/solr/ > > Otherwise, see below... > > On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang < > wolfgang.schrei...@itsv.at> wrote: > >> Hi all, >> >> We are currently evaluating potential search frameworks (such as Hibernate >> Search) which might be suitable to use in our project (using Spring, JPA >> with Hibernate) ... >> I am sending this E-Mail in hope you can advise me on a few issues that >> would help us in our decision making process. >> >> >> 1.) Is Lucene suitable for full text database searches? I read Lucene >> was designed to index and search documents but how does it behave querying >> relational data sets in general? >> > > Let's start be talking about the phrase "full text database searches". One > thing virtually all db-centric > people trip over is trying to use SOLR as if it were a database. You just > can't think about tables. The > first time you think about using SOLR to do something join-like, stop and > take a deep breath and > think about documents instead. The general approach is to flatten your data > so that each "document" > contains all the relevant info. Yes, this leads to de-normalization. Yes, > denormalized data makes a > good DBA cringe. But that's the difference between searching and using a > RDBMS. > > "Document" is somewhat misleading. A document in SOLR terms is just a > collection of fields. And, BTW, > there's no requirement that each document have the same fields (very unlike > a DB). > > >> >> 2.) Can we make assumptions on query performance considering combined >> searches, range queries or structured data and wildcard searches? If we >> consider a data structure consisting of say 3 tables and each table contains >> a few million entries (e.g. first name, last name and address fields) and we >> search for common values (such as 'John', 'Smith' and 'New York') where >> >> a. each value for itself and each combination would result in >> millions of hits >> > > Sure, but what those assumptions are is totally dependent on how you've set > things up. SOLR has been successfully > used on several billion document indexes. There are tools for making all > that work (i.e. replication, sharding, etc) > built into SOLR. So I suspect you can make things work. Several million > documents is not that large a data set. > > As always, there are tradeoffs between speed and complexity. But from what > you've described > I see no show stoppers. > > >> >> b. a person can have multiple first names and we want to make sure to >> receive any combination of the last name with any first name >> >> > This just sounds like an OR. But the queries can be pretty complex queries. > Some examples of what you expect would help. > See multi-valued fields. So, a "document" can have multiple "firstname" > entries. Again, not like a DB (your reflexes will trip you > up on this point ). > > >> c. we search for a last name and a range of birth dates >> >> > Sure, range queries work just fine. Note that dates can trip you up, look at > triedate if you experiment. > > >> 3.) Trans
Re: Lucene applicability
See below On Tue, Aug 31, 2010 at 5:17 AM, Schreiner Wolfgang < wolfgang.schrei...@itsv.at> wrote: > Hi, > > Thank you all for your time to answer my questions! > However there are a few more issues which are not quite clear yet and hope > to get advice on those too: > > 1.) How is the index maintained? In another product where we use an indexer > different from Lucene, we got one central index and a few JBoss servers all > accessing the same index. So how does Lucene handle synchronization between > multiple threads (different JVMs)? How does it maintain the index after > update/delete operations on the database? > It doesn't, you have to do it yourself. You'll have to write some app that periodically queries your db and when it detecte changes, updates the Lucene index. You should only have one JVM have an open writer at a time. You can have as many readers from as many jvms as you want. Note that your single JVM that has a writer can use multiple write threads. In other words writers are thread safe.. > 2.) Is the index always up-to-date? In the FAQs it says we have to re-open > the IndexReader periodically ... how expensive (in computational terms) is > it to do that on every request for instance? > the index is always up to date, how could it be otherwise . When you open a reader, Lucene essentially takes a snapshot of the index at that moment. Any updated are not visible to that reader, thus the comment about reopening the reader. > 3.) I'm still not sure about performance. According to the FAQs we need to > build our own MultiPhraseQuery parser to support multiple terms and > wildcards. For example consider 50.000.000 documents, 50.000 of them match > term T1 in category A, 50.000 match term T2 in category B and 1.000.000 > match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How > fast is the algorithm in this case? Who guarantees that it doesn't start at > the 1 million side? > > That can't be answered in the abstract because there are too many variables, you have to measure. But there are Lucene installations that have many more documents than that so you can probably get there. Consider SOLR if you need to replicate or shard your indexes because they get too big. I think you're probably thinking of Lucene in terms of an application, not an engine. Lucene is what you build your application around, it provides the guts of the searching. The rest is up to your. HTH Erick Cheers, > > w > > > -Original Message- > From: Lance Norskog [mailto:goks...@gmail.com] > Sent: Donnerstag, 26. August 2010 05:25 > To: java-user@lucene.apache.org > Subject: Re: Lucene applicability > > A stepping stone to the above is that, in DB terms, a Lucene index is > only one table. It has a suite of indexing features that are very > different from database search. The features are oriented to searching > large bodies of text for "ideas" rather than concrete words. It > searches a lot faster than a DB. It also spends more time creating its > various indexes than a DB. Other points- you can't add or drop fields > or indexes. > > On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson > wrote: > > The SOLR wiki has lots of good information, start there: > > http://wiki.apache.org/solr/ > > > > Otherwise, see below... > > > > On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang < > > wolfgang.schrei...@itsv.at> wrote: > > > >> Hi all, > >> > >> We are currently evaluating potential search frameworks (such as > Hibernate > >> Search) which might be suitable to use in our project (using Spring, JPA > >> with Hibernate) ... > >> I am sending this E-Mail in hope you can advise me on a few issues that > >> would help us in our decision making process. > >> > >> > >> 1.)Is Lucene suitable for full text database searches? I read Lucene > >> was designed to index and search documents but how does it behave > querying > >> relational data sets in general? > >> > > > > Let's start be talking about the phrase "full text database searches". > One > > thing virtually all db-centric > > people trip over is trying to use SOLR as if it were a database. You just > > can't think about tables. The > > first time you think about using SOLR to do something join-like, stop and > > take a deep breath and > > think about documents instead. The general approach is to flatten your > data > > so that each "document" > > contains all the relevant info. Yes, this leads to de-normalization. Yes, > > denormalized data makes a > > good DBA cringe. But that's the difference between searching and using a > > RDBMS. > > > > "Document" is somewhat misleading. A document in SOLR terms is just a > > collection of fields. And, BTW, > > there's no requirement that each document have the same fields (very > unlike > > a DB). > > > > > >> > >> 2.)Can we make assumptions on query performance considering combined > >> searches, range queries or structured data and wildcard searches? If we > >> consider a data structure consisting of
Lucene Revolution Update
Hi, (apologies for the cross-post) Just a quick update on Lucene Revolution - coming up in Boston, October 7-8 (see http://lucenerevolution.org). - Marten Mickos, CEO Eucalyptus Systems, ex-MySQL CEO will be giving a keynot on "How Open Source Leads Infrastructure Innovation" - Bill Press from Salesforce.com has been added to the Cutting Edge of Search panel (joining LinkedIn, Twitter, and eHarmony) - We've added several new talks, including Jon Gifford (Loggly), Erik Arnold (Lucene in Goverment/Search.USA.gov). I'll be doing a session comparing Apache Lucene, Solr and NoSQL - Submit your tough Solr challenges for the "Stump The Chump" session (send email to st...@lucenerevolution.org) There's still a limited number of seats available for the Lucene and Solr two-day trainings preceding the conference. If you're interested please register now. Also please be aware that the early bird rate expires September 10. Hope to see you there. Cheers, Grant -- Grant Ingersoll http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8