RE: Lucene applicability

2010-08-31 Thread Schreiner Wolfgang
Hi,

Thank you all for your time to answer my questions!
However there are a few more issues which are not quite clear yet and hope to 
get advice on those too:

1.) How is the index maintained? In another product where we use an indexer 
different from Lucene, we got one central index and a few JBoss servers all 
accessing the same index. So how does Lucene handle synchronization between 
multiple threads (different JVMs)? How does it maintain the index after 
update/delete operations on the database?
2.) Is the index always up-to-date? In the FAQs it says we have to re-open the 
IndexReader periodically ... how expensive (in computational terms) is it to do 
that on every request for instance?
3.) I'm still not sure about performance. According to the FAQs we need to 
build our own MultiPhraseQuery parser to support multiple terms and wildcards. 
For example consider 50.000.000 documents, 50.000 of them match term T1 in 
category A, 50.000 match term T2 in category B and 1.000.000 match term T3 in 
category C. 50 Match T1 in A and T2 in B and T3 in C. How fast is the algorithm 
in this case? Who guarantees that it doesn't start at the 1 million side?  

Cheers,

w


-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Donnerstag, 26. August 2010 05:25
To: java-user@lucene.apache.org
Subject: Re: Lucene applicability

A stepping stone to the above is that, in DB terms, a Lucene index is
only one table. It has a suite of indexing features that are very
different from database search. The features are oriented to searching
large bodies of text for "ideas" rather than concrete words. It
searches a lot faster than a DB. It also spends more time creating its
various indexes than a DB. Other points- you can't add or drop fields
or indexes.

On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
 wrote:
> The SOLR wiki has lots of good information, start there:
> http://wiki.apache.org/solr/
>
> Otherwise, see below...
>
> On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
> wolfgang.schrei...@itsv.at> wrote:
>
>> Hi all,
>>
>> We are currently evaluating potential search frameworks (such as Hibernate
>> Search) which might be suitable to use in our project (using Spring, JPA
>> with Hibernate) ...
>> I am sending this E-Mail in hope you can advise me on a few issues that
>> would help us in our decision making process.
>>
>>
>> 1.)    Is Lucene suitable for full text database searches? I read Lucene
>> was designed to index and search documents but how does it behave querying
>> relational data sets in general?
>>
>
> Let's start be talking about the phrase "full text database searches". One
> thing virtually all db-centric
> people trip over is trying to use SOLR as if it were a database. You just
> can't think about tables. The
> first time you think about using SOLR to do something join-like, stop and
> take a deep breath and
> think about documents instead. The general approach is to flatten your data
> so that each "document"
> contains all the relevant info. Yes, this leads to de-normalization. Yes,
> denormalized data makes a
> good DBA cringe. But that's the difference between searching and using a
> RDBMS.
>
> "Document" is somewhat misleading. A document in SOLR terms is just a
> collection of fields. And, BTW,
> there's no requirement that each document have the same fields (very unlike
> a DB).
>
>
>>
>> 2.)    Can we make assumptions on query performance considering combined
>> searches, range queries or structured data and wildcard searches? If we
>> consider a data structure consisting of say 3 tables and each table contains
>> a few million entries (e.g. first name, last name and address fields) and we
>> search for common values (such as 'John', 'Smith' and 'New York') where
>>
>> a.       each value for itself and each combination would result in
>> millions of hits
>>
>
> Sure, but what those assumptions are is totally dependent on how you've set
> things up. SOLR has been successfully
> used on several billion document indexes. There are tools for making all
> that work (i.e. replication, sharding, etc)
> built into SOLR. So I suspect you can make things work. Several million
> documents is not that large a data set.
>
> As always, there are tradeoffs between speed and complexity. But from what
> you've described
> I see no show stoppers.
>
>
>>
>> b.      a person can have multiple first names and we want to make sure to
>> receive any combination of the last name with any first name
>>
>>
> This just sounds like an OR. But the queries can be pretty complex queries.
> Some examples of what you expect would help.
> See multi-valued fields. So, a "document" can have multiple "firstname"
> entries. Again, not like a DB (your reflexes will trip you
> up on this point ).
>
>
>> c.       we search for a last name and a range of birth dates
>>
>>
> Sure, range queries work just fine. Note that dates can trip you up, look at
> triedate if you experiment.
>
>
>> 3.)    Trans

Re: Lucene applicability

2010-08-31 Thread Erick Erickson
See below

On Tue, Aug 31, 2010 at 5:17 AM, Schreiner Wolfgang <
wolfgang.schrei...@itsv.at> wrote:

> Hi,
>
> Thank you all for your time to answer my questions!
> However there are a few more issues which are not quite clear yet and hope
> to get advice on those too:
>
> 1.) How is the index maintained? In another product where we use an indexer
> different from Lucene, we got one central index and a few JBoss servers all
> accessing the same index. So how does Lucene handle synchronization between
> multiple threads (different JVMs)? How does it maintain the index after
> update/delete operations on the database?
>

It doesn't, you have to do it yourself. You'll have to write some app that
periodically queries your db and when it detecte changes, updates the Lucene
index. You should only have one JVM have an open writer at a time. You can
have as many readers from as many jvms as you want. Note that your single
JVM that has a writer can use multiple write threads. In other words writers
are thread safe..


> 2.) Is the index always up-to-date? In the FAQs it says we have to re-open
> the IndexReader periodically ... how expensive (in computational terms) is
> it to do that on every request for instance?
>

the index is always up to date, how could it be otherwise . When you open
a reader, Lucene essentially takes a snapshot of the index at that moment.
Any updated are not visible to that reader, thus the comment about reopening
the reader.


> 3.) I'm still not sure about performance. According to the FAQs we need to
> build our own MultiPhraseQuery parser to support multiple terms and
> wildcards. For example consider 50.000.000 documents, 50.000 of them match
> term T1 in category A, 50.000 match term T2 in category B and 1.000.000
> match term T3 in category C. 50 Match T1 in A and T2 in B and T3 in C. How
> fast is the algorithm in this case? Who guarantees that it doesn't start at
> the 1 million side?
>
>
That can't be answered in the abstract because there are too many variables,
you have to measure. But there are Lucene installations that have many more
documents than that so you can probably get there. Consider SOLR if you need
to replicate or shard your indexes because they get too big.

I think you're probably thinking of Lucene in terms of an application, not
an engine. Lucene is what you build your application around, it provides the
guts of the searching. The rest is up to your.

HTH
Erick

Cheers,
>
> w
>
>
> -Original Message-
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: Donnerstag, 26. August 2010 05:25
> To: java-user@lucene.apache.org
> Subject: Re: Lucene applicability
>
> A stepping stone to the above is that, in DB terms, a Lucene index is
> only one table. It has a suite of indexing features that are very
> different from database search. The features are oriented to searching
> large bodies of text for "ideas" rather than concrete words. It
> searches a lot faster than a DB. It also spends more time creating its
> various indexes than a DB. Other points- you can't add or drop fields
> or indexes.
>
> On Wed, Aug 25, 2010 at 10:33 AM, Erick Erickson
>  wrote:
> > The SOLR wiki has lots of good information, start there:
> > http://wiki.apache.org/solr/
> >
> > Otherwise, see below...
> >
> > On Wed, Aug 25, 2010 at 6:20 AM, Schreiner Wolfgang <
> > wolfgang.schrei...@itsv.at> wrote:
> >
> >> Hi all,
> >>
> >> We are currently evaluating potential search frameworks (such as
> Hibernate
> >> Search) which might be suitable to use in our project (using Spring, JPA
> >> with Hibernate) ...
> >> I am sending this E-Mail in hope you can advise me on a few issues that
> >> would help us in our decision making process.
> >>
> >>
> >> 1.)Is Lucene suitable for full text database searches? I read Lucene
> >> was designed to index and search documents but how does it behave
> querying
> >> relational data sets in general?
> >>
> >
> > Let's start be talking about the phrase "full text database searches".
> One
> > thing virtually all db-centric
> > people trip over is trying to use SOLR as if it were a database. You just
> > can't think about tables. The
> > first time you think about using SOLR to do something join-like, stop and
> > take a deep breath and
> > think about documents instead. The general approach is to flatten your
> data
> > so that each "document"
> > contains all the relevant info. Yes, this leads to de-normalization. Yes,
> > denormalized data makes a
> > good DBA cringe. But that's the difference between searching and using a
> > RDBMS.
> >
> > "Document" is somewhat misleading. A document in SOLR terms is just a
> > collection of fields. And, BTW,
> > there's no requirement that each document have the same fields (very
> unlike
> > a DB).
> >
> >
> >>
> >> 2.)Can we make assumptions on query performance considering combined
> >> searches, range queries or structured data and wildcard searches? If we
> >> consider a data structure consisting of

Lucene Revolution Update

2010-08-31 Thread Grant Ingersoll
Hi, (apologies for the cross-post)

Just a quick update on Lucene Revolution - coming up in Boston, October 7-8 
(see http://lucenerevolution.org).

- Marten Mickos, CEO Eucalyptus Systems, ex-MySQL CEO will be giving a keynot 
on "How Open Source Leads Infrastructure Innovation"
- Bill Press from Salesforce.com has been added to the  Cutting Edge of Search 
panel (joining LinkedIn, Twitter, and eHarmony)
- We've added several new talks, including Jon Gifford (Loggly), Erik Arnold 
(Lucene in Goverment/Search.USA.gov).  I'll be doing a session comparing Apache 
Lucene, Solr and NoSQL
- Submit your tough Solr challenges for the "Stump The Chump" session (send 
email to st...@lucenerevolution.org)

There's still a limited number of seats available for the Lucene and Solr 
two-day trainings preceding the conference. If you're interested please 
register now. Also please be aware that the early bird rate expires September 
10.

Hope to see you there.

Cheers,
Grant

--
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8