[SOLR] DisMaxQParserPlugin and Tokenization

2010-11-22 Thread jan.kurella
Hi,

if there is a solr newsgroup better suited form y question, please point me 
there.

Using the SearchHandler with the deftype=”dismax” option enables the 
DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by 
whitespace.

Although by looking in the code I could not find the place, where this behavior 
is enforced? I only found, that for each field the getFieldQuery() method is 
called, which either throws an “unknownField” exception or returns the correct 
analyzer including tokenizer and filter for the given field.

We want to use a more fancier Tokenizer/filter setting with the DisMaxQuery 
stuff.

Where to hook in best?

Jan


Re: [SOLR] DisMaxQParserPlugin and Tokenization

2010-11-22 Thread Ian Lea
> if there is a solr newsgroup better suited form y question, please point me 
> there.

http://lucene.apache.org/solr/mailing_lists.html


--
Ian.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-22 Thread Erick Erickson
Are you looking at Solr? It has a lot of the infrastructure you'll be
building
yourself for Lucene already built in. Including replication, distributed
searching, etc. Yes, there's a learning curve for something new, but
your Lucene experience will help you a LOT with that. It has support
for sharding (which is what you'll certainly have to do to handle your
billion+ documents). Don't re-invent the wheel!!

In conjunction, see SolrJ which provides you a java interface to Solr which
may come in handy.

Start here: http://wiki.apache.org/solr/

Best
Erick

On Mon, Nov 22, 2010 at 1:46 AM, Luca Rondanini wrote:

> Hi David, thanks for your answer. it really helped a lot! so, you have an
> index with more than 2 billions segments. this is pretty much the answer I
> was searching for: lucene alone is able to manage such a big index.
>
> which kind of problems do you have with the parallel searchers? I'm going
> to
> build my index in the next couple of weeks if you want we can confront our
> data
>
> thanks again
> Luca
>
>
> On Sun, Nov 21, 2010 at 6:22 PM, David Fertig  wrote:
>
> > Actually I've been bitten by an still-unresolved issue with the parallel
> > searchers and recommend a MultiReader instead.
> > We have a couple billion docs in our archives as well.  Breaking them up
> by
> > day worked well for us, but you'll need to do something.
> >
> > -Original Message-
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Sunday, November 21, 2010 8:13 PM
> > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: best practice: 1.4 billions documents
> >
> > thank you both!
> >
> > Johannes, katta seems interesting but I will need to solve the problems
> of
> > "hot" updates to the index
> >
> > Yonik, I see your point - so your suggestion would be to build an
> > architecture based on ParallelMultiSearcher?
> >
> >
> > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley <
> yo...@lucidimagination.com
> > >wrote:
> >
> > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > >  wrote:
> > > > Hi everybody,
> > > >
> > > > I really need some good advice! I need to index in lucene something
> > like
> > > 1.4
> > > > billions documents. I had experience in lucene but I've never worked
> > with
> > > > such a big number of documents. Also this is just the number of docs
> at
> > > > "start-up": they are going to grow and fast.
> > > >
> > > > I don't have to tell you that I need the system to be fast and to
> > support
> > > > real time updates to the documents
> > > >
> > > > The first solution that came to my mind was to use
> > ParallelMultiSearcher,
> > > > splitting the index into many "sub-index" (how many docs per index?
> > > > 100,000?) but I don't have experience with it and I don't know how
> well
> > > will
> > > > scale while the number of documents grows!
> > > >
> > > > A more solid solution seems to build some kind of integration with
> > > hadoop.
> > > > But I didn't find match about lucene and hadoop integration.
> > > >
> > > > Any idea? Which direction should I go (pure lucene or hadoop)?
> > >
> > > There seems to be a common misconception about hadoop regarding search.
> > > Map-reduce as implemented in hadoop is really for batch oriented jobs
> > > only (or those types of jobs where you don't need a quick response
> > > time).  It's definitely not for normal queries (unless you have
> > > unusual requirements).
> > >
> > > -Yonik
> > > http://www.lucidimagination.com
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
>


RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
>> We have a couple billion docs in our archives as well...Breaking them up by 
>> day worked well for us

We do not have 2 billion segments in one index  We have roughly 5-10 million 
documents per index. We are currently using a miltisearcher but unresolved 
lucene issues in this will force us to move to a multireader.

As far as the parallel searcher goes, read back on the thread with subject 
"Search returning documents matching a NOT range".
There is an acknowledged/proven bug with a small unit test, but there is some 
disagreement about the internal reasons it fails. I have been unable to 
generate further discussion or a resolution. This was supposed to be added as a 
bug to the JIRA for the 3.3 release, but has not been.  I am not which class 
Solr uses, but if it uses MultiSearcher, it will have the same bug.

-Original Message-
From: Luca Rondanini [mailto:luca.rondan...@gmail.com] 
Sent: Monday, November 22, 2010 1:47 AM
To: java-user@lucene.apache.org
Subject: Re: best practice: 1.4 billions documents

Hi David, thanks for your answer. it really helped a lot! so, you have an
index with more than 2 billions segments. this is pretty much the answer I
was searching for: lucene alone is able to manage such a big index.

which kind of problems do you have with the parallel searchers? I'm going to
build my index in the next couple of weeks if you want we can confront our
data

thanks again
Luca


On Sun, Nov 21, 2010 at 6:22 PM, David Fertig  wrote:

> Actually I've been bitten by an still-unresolved issue with the parallel
> searchers and recommend a MultiReader instead.
> We have a couple billion docs in our archives as well.  Breaking them up by
> day worked well for us, but you'll need to do something.
>
> -Original Message-
> From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> Sent: Sunday, November 21, 2010 8:13 PM
> To: java-user@lucene.apache.org; yo...@lucidimagination.com
> Subject: Re: best practice: 1.4 billions documents
>
> thank you both!
>
> Johannes, katta seems interesting but I will need to solve the problems of
> "hot" updates to the index
>
> Yonik, I see your point - so your suggestion would be to build an
> architecture based on ParallelMultiSearcher?
>
>
> On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley  >wrote:
>
> > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> >  wrote:
> > > Hi everybody,
> > >
> > > I really need some good advice! I need to index in lucene something
> like
> > 1.4
> > > billions documents. I had experience in lucene but I've never worked
> with
> > > such a big number of documents. Also this is just the number of docs at
> > > "start-up": they are going to grow and fast.
> > >
> > > I don't have to tell you that I need the system to be fast and to
> support
> > > real time updates to the documents
> > >
> > > The first solution that came to my mind was to use
> ParallelMultiSearcher,
> > > splitting the index into many "sub-index" (how many docs per index?
> > > 100,000?) but I don't have experience with it and I don't know how well
> > will
> > > scale while the number of documents grows!
> > >
> > > A more solid solution seems to build some kind of integration with
> > hadoop.
> > > But I didn't find match about lucene and hadoop integration.
> > >
> > > Any idea? Which direction should I go (pure lucene or hadoop)?
> >
> > There seems to be a common misconception about hadoop regarding search.
> > Map-reduce as implemented in hadoop is really for batch oriented jobs
> > only (or those types of jobs where you don't need a quick response
> > time).  It's definitely not for normal queries (unless you have
> > unusual requirements).
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
There is no reason to use MultiSearcher instead the much more consistent and 
effective  MultiReader! We (Robert and me) are already planning to deprecate 
it. MultiSearcher itsself has no benefit over a simple IndexSearcher on top of 
a MultiReader since Lucene 2.9, it has only problems.

Use cases for real MultiSearchers are only the subclasses for "remote search" 
or (perhaps) multi-threaded search, but the latter I would not recommend 
(instead let the additional CPUs in your machine be free for other users doing 
searches in parallel). Multithreading a single search should not be done, as it 
slows down multiple users accessing the same index at the same time. Spend the 
additional CPU power for other things like warming searchers, indexing 
additional documents, or filling FieldCache in parallel.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: David Fertig [mailto:dfer...@cymfony.com]
> Sent: Monday, November 22, 2010 4:54 PM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> >> We have a couple billion docs in our archives as well...Breaking them
> >> up by day worked well for us
> 
> We do not have 2 billion segments in one index  We have roughly 5-10 million
> documents per index. We are currently using a miltisearcher but unresolved
> lucene issues in this will force us to move to a multireader.
> 
> As far as the parallel searcher goes, read back on the thread with subject
> "Search returning documents matching a NOT range".
> There is an acknowledged/proven bug with a small unit test, but there is some
> disagreement about the internal reasons it fails. I have been unable to
> generate further discussion or a resolution. This was supposed to be added as 
> a
> bug to the JIRA for the 3.3 release, but has not been.  I am not which class 
> Solr
> uses, but if it uses MultiSearcher, it will have the same bug.
> 
> -Original Message-
> From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> Sent: Monday, November 22, 2010 1:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> Hi David, thanks for your answer. it really helped a lot! so, you have an 
> index
> with more than 2 billions segments. this is pretty much the answer I was
> searching for: lucene alone is able to manage such a big index.
> 
> which kind of problems do you have with the parallel searchers? I'm going to
> build my index in the next couple of weeks if you want we can confront our
> data
> 
> thanks again
> Luca
> 
> 
> On Sun, Nov 21, 2010 at 6:22 PM, David Fertig  wrote:
> 
> > Actually I've been bitten by an still-unresolved issue with the
> > parallel searchers and recommend a MultiReader instead.
> > We have a couple billion docs in our archives as well.  Breaking them
> > up by day worked well for us, but you'll need to do something.
> >
> > -Original Message-
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Sunday, November 21, 2010 8:13 PM
> > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: best practice: 1.4 billions documents
> >
> > thank you both!
> >
> > Johannes, katta seems interesting but I will need to solve the
> > problems of "hot" updates to the index
> >
> > Yonik, I see your point - so your suggestion would be to build an
> > architecture based on ParallelMultiSearcher?
> >
> >
> > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
> >  > >wrote:
> >
> > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > >  wrote:
> > > > Hi everybody,
> > > >
> > > > I really need some good advice! I need to index in lucene
> > > > something
> > like
> > > 1.4
> > > > billions documents. I had experience in lucene but I've never
> > > > worked
> > with
> > > > such a big number of documents. Also this is just the number of
> > > > docs at
> > > > "start-up": they are going to grow and fast.
> > > >
> > > > I don't have to tell you that I need the system to be fast and to
> > support
> > > > real time updates to the documents
> > > >
> > > > The first solution that came to my mind was to use
> > ParallelMultiSearcher,
> > > > splitting the index into many "sub-index" (how many docs per index?
> > > > 100,000?) but I don't have experience with it and I don't know how
> > > > well
> > > will
> > > > scale while the number of documents grows!
> > > >
> > > > A more solid solution seems to build some kind of integration with
> > > hadoop.
> > > > But I didn't find match about lucene and hadoop integration.
> > > >
> > > > Any idea? Which direction should I go (pure lucene or hadoop)?
> > >
> > > There seems to be a common misconception about hadoop regarding
> search.
> > > Map-reduce as implemented in hadoop is really for batch oriented
> > > jobs only (or those types of jobs where you don't need a quick
> > > response time).  It's definitely not for normal queries (unless you
>

incremental indexation

2010-11-22 Thread ZYWALEWSKI, DANIEL (DANIEL)
Hello,
 I'm just stuck with one problem and don't know how to figure it out. I'm 
working on the indexation of the objects that are in computer memory (they 
exist only in my java code). Don't have any problems with indexing it, however 
I have no idea how to re-index it if they change during the execution of this 
code; one of my ideas is adding some events to this objects (if you change any 
parameters -> reindex it ). However I'm not sure about its efficiency?
 Thank you in advance
   Daniel



RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
> it has only problems.
Perhaps these known problems should be added to the doc api, so users who are 
encouraged to start clean with the 3.x API don't build bad applications from 
scratch?

Parallel searching is extremely powerful and should not be abandoned.



-Original Message-
From: Uwe Schindler [mailto:u...@thetaphi.de] 
Sent: Monday, November 22, 2010 11:19 AM
To: java-user@lucene.apache.org
Subject: RE: best practice: 1.4 billions documents

There is no reason to use MultiSearcher instead the much more consistent and 
effective  MultiReader! We (Robert and me) are already planning to deprecate 
it. MultiSearcher itsself has no benefit over a simple IndexSearcher on top of 
a MultiReader since Lucene 2.9, it has only problems.

Use cases for real MultiSearchers are only the subclasses for "remote search" 
or (perhaps) multi-threaded search, but the latter I would not recommend 
(instead let the additional CPUs in your machine be free for other users doing 
searches in parallel). Multithreading a single search should not be done, as it 
slows down multiple users accessing the same index at the same time. Spend the 
additional CPU power for other things like warming searchers, indexing 
additional documents, or filling FieldCache in parallel.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: David Fertig [mailto:dfer...@cymfony.com]
> Sent: Monday, November 22, 2010 4:54 PM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> >> We have a couple billion docs in our archives as well...Breaking them
> >> up by day worked well for us
> 
> We do not have 2 billion segments in one index  We have roughly 5-10 million
> documents per index. We are currently using a miltisearcher but unresolved
> lucene issues in this will force us to move to a multireader.
> 
> As far as the parallel searcher goes, read back on the thread with subject
> "Search returning documents matching a NOT range".
> There is an acknowledged/proven bug with a small unit test, but there is some
> disagreement about the internal reasons it fails. I have been unable to
> generate further discussion or a resolution. This was supposed to be added as 
> a
> bug to the JIRA for the 3.3 release, but has not been.  I am not which class 
> Solr
> uses, but if it uses MultiSearcher, it will have the same bug.
> 
> -Original Message-
> From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> Sent: Monday, November 22, 2010 1:47 AM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> Hi David, thanks for your answer. it really helped a lot! so, you have an 
> index
> with more than 2 billions segments. this is pretty much the answer I was
> searching for: lucene alone is able to manage such a big index.
> 
> which kind of problems do you have with the parallel searchers? I'm going to
> build my index in the next couple of weeks if you want we can confront our
> data
> 
> thanks again
> Luca
> 
> 
> On Sun, Nov 21, 2010 at 6:22 PM, David Fertig  wrote:
> 
> > Actually I've been bitten by an still-unresolved issue with the
> > parallel searchers and recommend a MultiReader instead.
> > We have a couple billion docs in our archives as well.  Breaking them
> > up by day worked well for us, but you'll need to do something.
> >
> > -Original Message-
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Sunday, November 21, 2010 8:13 PM
> > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > Subject: Re: best practice: 1.4 billions documents
> >
> > thank you both!
> >
> > Johannes, katta seems interesting but I will need to solve the
> > problems of "hot" updates to the index
> >
> > Yonik, I see your point - so your suggestion would be to build an
> > architecture based on ParallelMultiSearcher?
> >
> >
> > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
> >  > >wrote:
> >
> > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > >  wrote:
> > > > Hi everybody,
> > > >
> > > > I really need some good advice! I need to index in lucene
> > > > something
> > like
> > > 1.4
> > > > billions documents. I had experience in lucene but I've never
> > > > worked
> > with
> > > > such a big number of documents. Also this is just the number of
> > > > docs at
> > > > "start-up": they are going to grow and fast.
> > > >
> > > > I don't have to tell you that I need the system to be fast and to
> > support
> > > > real time updates to the documents
> > > >
> > > > The first solution that came to my mind was to use
> > ParallelMultiSearcher,
> > > > splitting the index into many "sub-index" (how many docs per index?
> > > > 100,000?) but I don't have experience with it and I don't know how
> > > > well
> > > will
> > > > scale while the number of documents grows!
> > > >
> > > > A more solid solution seems to build some kind of integration with

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
A local multithreaded search can be done in another way even for a single 
index, but not using the impl of (Parallel)MultiSearcher. This may be a new 
class directly extending IndexSearcher, which may even do parallel search on 
e.g. different segments (because searching a MultiReader is no longer different 
from DirectoryReader, as segments of a single Lucene index are internally also 
handled like MultiReaders).

The problem with MultiSearcher is the way how query rewriting is handled, 
because soring needs every subsearcher docing the same query.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: David Fertig [mailto:dfer...@cymfony.com]
> Sent: Monday, November 22, 2010 5:57 PM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> > it has only problems.
> Perhaps these known problems should be added to the doc api, so users who
> are encouraged to start clean with the 3.x API don't build bad applications 
> from
> scratch?
> 
> Parallel searching is extremely powerful and should not be abandoned.
> 
> 
> 
> -Original Message-
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Monday, November 22, 2010 11:19 AM
> To: java-user@lucene.apache.org
> Subject: RE: best practice: 1.4 billions documents
> 
> There is no reason to use MultiSearcher instead the much more consistent and
> effective  MultiReader! We (Robert and me) are already planning to deprecate
> it. MultiSearcher itsself has no benefit over a simple IndexSearcher on top 
> of a
> MultiReader since Lucene 2.9, it has only problems.
> 
> Use cases for real MultiSearchers are only the subclasses for "remote search"
> or (perhaps) multi-threaded search, but the latter I would not recommend
> (instead let the additional CPUs in your machine be free for other users doing
> searches in parallel). Multithreading a single search should not be done, as 
> it
> slows down multiple users accessing the same index at the same time. Spend
> the additional CPU power for other things like warming searchers, indexing
> additional documents, or filling FieldCache in parallel.
> 
> Uwe
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> > -Original Message-
> > From: David Fertig [mailto:dfer...@cymfony.com]
> > Sent: Monday, November 22, 2010 4:54 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: best practice: 1.4 billions documents
> >
> > >> We have a couple billion docs in our archives as well...Breaking
> > >> them up by day worked well for us
> >
> > We do not have 2 billion segments in one index  We have roughly 5-10
> > million documents per index. We are currently using a miltisearcher
> > but unresolved lucene issues in this will force us to move to a multireader.
> >
> > As far as the parallel searcher goes, read back on the thread with
> > subject "Search returning documents matching a NOT range".
> > There is an acknowledged/proven bug with a small unit test, but there
> > is some disagreement about the internal reasons it fails. I have been
> > unable to generate further discussion or a resolution. This was
> > supposed to be added as a bug to the JIRA for the 3.3 release, but has
> > not been.  I am not which class Solr uses, but if it uses MultiSearcher, it 
> > will
> have the same bug.
> >
> > -Original Message-
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Monday, November 22, 2010 1:47 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: best practice: 1.4 billions documents
> >
> > Hi David, thanks for your answer. it really helped a lot! so, you have
> > an index with more than 2 billions segments. this is pretty much the
> > answer I was searching for: lucene alone is able to manage such a big index.
> >
> > which kind of problems do you have with the parallel searchers? I'm
> > going to build my index in the next couple of weeks if you want we can
> > confront our data
> >
> > thanks again
> > Luca
> >
> >
> > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig 
> wrote:
> >
> > > Actually I've been bitten by an still-unresolved issue with the
> > > parallel searchers and recommend a MultiReader instead.
> > > We have a couple billion docs in our archives as well.  Breaking
> > > them up by day worked well for us, but you'll need to do something.
> > >
> > > -Original Message-
> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > > Sent: Sunday, November 21, 2010 8:13 PM
> > > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > > Subject: Re: best practice: 1.4 billions documents
> > >
> > > thank you both!
> > >
> > > Johannes, katta seems interesting but I will need to solve the
> > > problems of "hot" updates to the index
> > >
> > > Yonik, I see your point - so your suggestion would be to build an
> > > architecture based on ParallelMultiSear

Re: best practice: 1.4 billions documents

2010-11-22 Thread eks dev
Am I the only one who thinks this is not the way to go, MultiReader (or
MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on
one machine is a big number, does not matter how you partition them (or you
have some really expensive hardware at your disposal).  Did I miss the point
somewhere with this recommendation "use MultiReader and you are good for
1.4B Document"?

Imo, you must distribute your index across many machines.

Your best chance is to look at solr cloud and solr replication (solr Wiki is
your friend). Of course, you can do it yourself, but building distributet
setup with what you call "real time updates" is a huge pain.

Alternatively, google for lucene or solr on cassandra (has some very nice
properties about update latency and architectural simplicity).I do not know
if this is somewhere in production.

Good luck,
e.



On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler  wrote:

> There is no reason to use MultiSearcher instead the much more consistent
> and effective  MultiReader! We (Robert and me) are already planning to
> deprecate it. MultiSearcher itsself has no benefit over a simple
> IndexSearcher on top of a MultiReader since Lucene 2.9, it has only
> problems.
>
> Use cases for real MultiSearchers are only the subclasses for "remote
> search" or (perhaps) multi-threaded search, but the latter I would not
> recommend (instead let the additional CPUs in your machine be free for other
> users doing searches in parallel). Multithreading a single search should not
> be done, as it slows down multiple users accessing the same index at the
> same time. Spend the additional CPU power for other things like warming
> searchers, indexing additional documents, or filling FieldCache in parallel.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: David Fertig [mailto:dfer...@cymfony.com]
> > Sent: Monday, November 22, 2010 4:54 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: best practice: 1.4 billions documents
> >
> > >> We have a couple billion docs in our archives as well...Breaking them
> > >> up by day worked well for us
> >
> > We do not have 2 billion segments in one index  We have roughly 5-10
> million
> > documents per index. We are currently using a miltisearcher but
> unresolved
> > lucene issues in this will force us to move to a multireader.
> >
> > As far as the parallel searcher goes, read back on the thread with
> subject
> > "Search returning documents matching a NOT range".
> > There is an acknowledged/proven bug with a small unit test, but there is
> some
> > disagreement about the internal reasons it fails. I have been unable to
> > generate further discussion or a resolution. This was supposed to be
> added as a
> > bug to the JIRA for the 3.3 release, but has not been.  I am not which
> class Solr
> > uses, but if it uses MultiSearcher, it will have the same bug.
> >
> > -Original Message-
> > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > Sent: Monday, November 22, 2010 1:47 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: best practice: 1.4 billions documents
> >
> > Hi David, thanks for your answer. it really helped a lot! so, you have an
> index
> > with more than 2 billions segments. this is pretty much the answer I was
> > searching for: lucene alone is able to manage such a big index.
> >
> > which kind of problems do you have with the parallel searchers? I'm going
> to
> > build my index in the next couple of weeks if you want we can confront
> our
> > data
> >
> > thanks again
> > Luca
> >
> >
> > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig 
> wrote:
> >
> > > Actually I've been bitten by an still-unresolved issue with the
> > > parallel searchers and recommend a MultiReader instead.
> > > We have a couple billion docs in our archives as well.  Breaking them
> > > up by day worked well for us, but you'll need to do something.
> > >
> > > -Original Message-
> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > > Sent: Sunday, November 21, 2010 8:13 PM
> > > To: java-user@lucene.apache.org; yo...@lucidimagination.com
> > > Subject: Re: best practice: 1.4 billions documents
> > >
> > > thank you both!
> > >
> > > Johannes, katta seems interesting but I will need to solve the
> > > problems of "hot" updates to the index
> > >
> > > Yonik, I see your point - so your suggestion would be to build an
> > > architecture based on ParallelMultiSearcher?
> > >
> > >
> > > On Sun, Nov 21, 2010 at 3:48 PM, Yonik Seeley
> > >  > > >wrote:
> > >
> > > > On Sun, Nov 21, 2010 at 6:33 PM, Luca Rondanini
> > > >  wrote:
> > > > > Hi everybody,
> > > > >
> > > > > I really need some good advice! I need to index in lucene
> > > > > something
> > > like
> > > > 1.4
> > > > > billions documents. I had experience in lucene but I've never
> > > > > worked
> > > with
> > > > > such a big number of docum

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
The latest discussion was more about MultiReader vs. MultiSearcher.

But you are right, 1.4 B documents is not easy to go, especially when you
index grows and you get to the 2.1 B marker, then no MultiSearcher or
whatever helps.

On the other hand even distributed Solr has the same problems like
MultiSearcher: scoring MultiTermQueries (Fuzzy) doesn't work correctly and
negative MTQ clauses may produce wrong results if the query rewriting is
done like in MultiSearcher (which is unsolveable broken and broken and
broken and again broken for some queries as Boolean clauses - see DeMorgan
laws).

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: eks...@googlemail.com [mailto:eks...@googlemail.com] On Behalf Of
> eks dev
> Sent: Monday, November 22, 2010 5:55 PM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> Am I the only one who thinks this is not the way to go, MultiReader (or
> MulitiSearcher) is not going to fix your problems. Having 1.4B Documents
on
> one machine is a big number, does not matter how you partition them (or
you
> have some really expensive hardware at your disposal).  Did I miss the
point
> somewhere with this recommendation "use MultiReader and you are good for
> 1.4B Document"?
> 
> Imo, you must distribute your index across many machines.
> 
> Your best chance is to look at solr cloud and solr replication (solr Wiki
is your
> friend). Of course, you can do it yourself, but building distributet setup
with
> what you call "real time updates" is a huge pain.
> 
> Alternatively, google for lucene or solr on cassandra (has some very nice
> properties about update latency and architectural simplicity).I do not
know if
> this is somewhere in production.
> 
> Good luck,
> e.
> 
> 
> 
> On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler  wrote:
> 
> > There is no reason to use MultiSearcher instead the much more
> > consistent and effective  MultiReader! We (Robert and me) are already
> > planning to deprecate it. MultiSearcher itsself has no benefit over a
> > simple IndexSearcher on top of a MultiReader since Lucene 2.9, it has
> > only problems.
> >
> > Use cases for real MultiSearchers are only the subclasses for "remote
> > search" or (perhaps) multi-threaded search, but the latter I would not
> > recommend (instead let the additional CPUs in your machine be free for
> > other users doing searches in parallel). Multithreading a single
> > search should not be done, as it slows down multiple users accessing
> > the same index at the same time. Spend the additional CPU power for
> > other things like warming searchers, indexing additional documents, or
filling
> FieldCache in parallel.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > > -Original Message-
> > > From: David Fertig [mailto:dfer...@cymfony.com]
> > > Sent: Monday, November 22, 2010 4:54 PM
> > > To: java-user@lucene.apache.org
> > > Subject: RE: best practice: 1.4 billions documents
> > >
> > > >> We have a couple billion docs in our archives as well...Breaking
> > > >> them up by day worked well for us
> > >
> > > We do not have 2 billion segments in one index  We have roughly 5-10
> > million
> > > documents per index. We are currently using a miltisearcher but
> > unresolved
> > > lucene issues in this will force us to move to a multireader.
> > >
> > > As far as the parallel searcher goes, read back on the thread with
> > subject
> > > "Search returning documents matching a NOT range".
> > > There is an acknowledged/proven bug with a small unit test, but
> > > there is
> > some
> > > disagreement about the internal reasons it fails. I have been unable
> > > to generate further discussion or a resolution. This was supposed to
> > > be
> > added as a
> > > bug to the JIRA for the 3.3 release, but has not been.  I am not
> > > which
> > class Solr
> > > uses, but if it uses MultiSearcher, it will have the same bug.
> > >
> > > -Original Message-
> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > > Sent: Monday, November 22, 2010 1:47 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: best practice: 1.4 billions documents
> > >
> > > Hi David, thanks for your answer. it really helped a lot! so, you
> > > have an
> > index
> > > with more than 2 billions segments. this is pretty much the answer I
> > > was searching for: lucene alone is able to manage such a big index.
> > >
> > > which kind of problems do you have with the parallel searchers? I'm
> > > going
> > to
> > > build my index in the next couple of weeks if you want we can
> > > confront
> > our
> > > data
> > >
> > > thanks again
> > > Luca
> > >
> > >
> > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig 
> > wrote:
> > >
> > > > Actually I've been bitten by an still-unresolved issue with the
> > 

Re: best practice: 1.4 billions documents

2010-11-22 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler  wrote:
> The latest discussion was more about MultiReader vs. MultiSearcher.
>
> But you are right, 1.4 B documents is not easy to go, especially when you
> index grows and you get to the 2.1 B marker, then no MultiSearcher or
> whatever helps.
>
> On the other hand even distributed Solr has the same problems like
> MultiSearcher: scoring MultiTermQueries (Fuzzy) doesn't work correctly

Are you referring to the idf being local to the shard instead of
global to the whole colleciton?
Andrzej has a patch in the works, but it's not committed yet.

> negative MTQ clauses may produce wrong results if the query rewriting is
> done like in MultiSearcher (which is unsolveable broken and broken and
> broken and again broken for some queries as Boolean clauses - see DeMorgan
> laws).

I don't think this is a problem for Solr.  Queries are executed on
each shard as normal (no difference from a non-distributed query).

-Yonik
http://www.lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
Hi Yonik,

Can we do the same for Lucene, the problem is combining the rewritten
queries using the broken method in Query?

As far as I know, the problem is that e.g. MTQs rewrite *per searcher* so
each searcher uses a different rewritten query (with different terms). So
the scores are totally different even with a tf-idf patch (Fuzzy scores on
MultiSearcher and Solr are totally wrong because each shard uses another
rewritten query). To work around that, the Query class has a broken, broken,
broken, broken, broken method to combine queries, which violates DeMorgans
laws when there are e.g. negative clauses. And this method cannot be fixed
to work with all queries
(https://issues.apache.org/jira/browse/LUCENE-2756).

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
> Seeley
> Sent: Monday, November 22, 2010 6:29 PM
> To: java-user@lucene.apache.org
> Subject: Re: best practice: 1.4 billions documents
> 
> On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler  wrote:
> > The latest discussion was more about MultiReader vs. MultiSearcher.
> >
> > But you are right, 1.4 B documents is not easy to go, especially when
> > you index grows and you get to the 2.1 B marker, then no MultiSearcher
> > or whatever helps.
> >
> > On the other hand even distributed Solr has the same problems like
> > MultiSearcher: scoring MultiTermQueries (Fuzzy) doesn't work correctly
> 
> Are you referring to the idf being local to the shard instead of global to
the
> whole colleciton?
> Andrzej has a patch in the works, but it's not committed yet.
> 
> > negative MTQ clauses may produce wrong results if the query rewriting
> > is done like in MultiSearcher (which is unsolveable broken and broken
> > and broken and again broken for some queries as Boolean clauses - see
> > DeMorgan laws).
> 
> I don't think this is a problem for Solr.  Queries are executed on each
shard as
> normal (no difference from a non-distributed query).
> 
> -Yonik
> http://www.lucidimagination.com
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
Thank you all, I really got some good hints!

of course I will distribute my index over many machines: store everything on
one computer is just crazy, 1.4B docs is going to be an index of almost 2T
(in my case)

the best solution for me at the moment (from your suggestions) seems to
identify a criteria to force a request (search/update) to access only a
subset of the index. Multi or Parallel SearchersI'll decide later.

Solr is a really good option and I've already planned of "stealing" parts of
code but I have time and resources to try to build my own platform
especially since my data need heavy processing.

I'll keep you posted
Luca






On Mon, Nov 22, 2010 at 8:54 AM, eks dev  wrote:

> Am I the only one who thinks this is not the way to go, MultiReader (or
> MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on
> one machine is a big number, does not matter how you partition them (or you
> have some really expensive hardware at your disposal).  Did I miss the
> point
> somewhere with this recommendation "use MultiReader and you are good for
> 1.4B Document"?
>
> Imo, you must distribute your index across many machines.
>
> Your best chance is to look at solr cloud and solr replication (solr Wiki
> is
> your friend). Of course, you can do it yourself, but building distributet
> setup with what you call "real time updates" is a huge pain.
>
> Alternatively, google for lucene or solr on cassandra (has some very nice
> properties about update latency and architectural simplicity).I do not know
> if this is somewhere in production.
>
> Good luck,
> e.
>
>
>
> On Mon, Nov 22, 2010 at 5:18 PM, Uwe Schindler  wrote:
>
> > There is no reason to use MultiSearcher instead the much more consistent
> > and effective  MultiReader! We (Robert and me) are already planning to
> > deprecate it. MultiSearcher itsself has no benefit over a simple
> > IndexSearcher on top of a MultiReader since Lucene 2.9, it has only
> > problems.
> >
> > Use cases for real MultiSearchers are only the subclasses for "remote
> > search" or (perhaps) multi-threaded search, but the latter I would not
> > recommend (instead let the additional CPUs in your machine be free for
> other
> > users doing searches in parallel). Multithreading a single search should
> not
> > be done, as it slows down multiple users accessing the same index at the
> > same time. Spend the additional CPU power for other things like warming
> > searchers, indexing additional documents, or filling FieldCache in
> parallel.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> > > -Original Message-
> > > From: David Fertig [mailto:dfer...@cymfony.com]
> > > Sent: Monday, November 22, 2010 4:54 PM
> > > To: java-user@lucene.apache.org
> > > Subject: RE: best practice: 1.4 billions documents
> > >
> > > >> We have a couple billion docs in our archives as well...Breaking
> them
> > > >> up by day worked well for us
> > >
> > > We do not have 2 billion segments in one index  We have roughly 5-10
> > million
> > > documents per index. We are currently using a miltisearcher but
> > unresolved
> > > lucene issues in this will force us to move to a multireader.
> > >
> > > As far as the parallel searcher goes, read back on the thread with
> > subject
> > > "Search returning documents matching a NOT range".
> > > There is an acknowledged/proven bug with a small unit test, but there
> is
> > some
> > > disagreement about the internal reasons it fails. I have been unable to
> > > generate further discussion or a resolution. This was supposed to be
> > added as a
> > > bug to the JIRA for the 3.3 release, but has not been.  I am not which
> > class Solr
> > > uses, but if it uses MultiSearcher, it will have the same bug.
> > >
> > > -Original Message-
> > > From: Luca Rondanini [mailto:luca.rondan...@gmail.com]
> > > Sent: Monday, November 22, 2010 1:47 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: best practice: 1.4 billions documents
> > >
> > > Hi David, thanks for your answer. it really helped a lot! so, you have
> an
> > index
> > > with more than 2 billions segments. this is pretty much the answer I
> was
> > > searching for: lucene alone is able to manage such a big index.
> > >
> > > which kind of problems do you have with the parallel searchers? I'm
> going
> > to
> > > build my index in the next couple of weeks if you want we can confront
> > our
> > > data
> > >
> > > thanks again
> > > Luca
> > >
> > >
> > > On Sun, Nov 21, 2010 at 6:22 PM, David Fertig 
> > wrote:
> > >
> > > > Actually I've been bitten by an still-unresolved issue with the
> > > > parallel searchers and recommend a MultiReader instead.
> > > > We have a couple billion docs in our archives as well.  Breaking them
> > > > up by day worked well for us, but you'll need to do something.
> > > >
> > > > -Original Message-
> > > > From: Luca Rondani

RE: best practice: 1.4 billions documents

2010-11-22 Thread spring
> of course I will distribute my index over many machines: 
> store everything on
> one computer is just crazy, 1.4B docs is going to be an index 
> of almost 2T
> (in my case)

billion = giga in english
billion = tera in non-english

2T docs = 2.000.000.000.000 docs... ;)

AFAIK 2 ^ 32 - 1 docs is still the max for a lucene instance.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
eheheheh,

1.4 billion of documents = 1,400,000,000 documents for almost 2T = 2
therabites = 2000 gigas on HD!





On Mon, Nov 22, 2010 at 10:16 AM,  wrote:

> > of course I will distribute my index over many machines:
> > store everything on
> > one computer is just crazy, 1.4B docs is going to be an index
> > of almost 2T
> > (in my case)
>
> billion = giga in english
> billion = tera in non-english
>
> 2T docs = 2.000.000.000.000 docs... ;)
>
> AFAIK 2 ^ 32 - 1 docs is still the max for a lucene instance.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


ICUTokenizer and CJK

2010-11-22 Thread Burton-West, Tom
Hi all,

I see in the javadoc for the ICUTokenizer that it has special handling for 
Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it 
does with CJK, which for C and J appears to be breaking into unigrams. Is this 
correct?


Tom