m [mailto:ansh...@gmail.com]
> Sent: Wednesday, August 11, 2010 10:38 AM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> So, you didn't really use the setRamBuffer.. ?
> Any reasons for that?
>
> --
> Anshum Gupta
> http://ai-cafe.blogspot.c
nfosys.com
Phone: (M) 91 992 369 7200, (VoIP)2022978622
-Original Message-
From: Anshum [mailto:ansh...@gmail.com]
Sent: Wednesday, August 11, 2010 10:38 AM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
So, you didn't really use the setRamBuffer.. ?
Any r
gt;
> -Original Message-
> From: Pablo Mendes [mailto:pablomen...@gmail.com]
> Sent: Tuesday, August 10, 2010 7:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Shelly,
> Do you mind sharing with the list the final settings you used for
-
> From: Pablo Mendes [mailto:pablomen...@gmail.com]
> Sent: Tuesday, August 10, 2010 7:22 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Shelly,
> Do you mind sharing with the list the final settings you used for your best
> results?
compare with regular docs.
-Original Message-
From: Pablo Mendes [mailto:pablomen...@gmail.com]
Sent: Tuesday, August 10, 2010 7:22 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Shelly,
Do you mind sharing with the list the final settings you used for your best
2nd Ed. It'll help you get a hang of a lot of things! :)
>
> --
> Anshum
> http://blog.anshumgupta.net
>
> Sent from BlackBerry®
>
> -Original Message-
> From: Shelly_Singh
> Date: Tue, 10 Aug 2010 19:11:11
> To: java-user@lucene.apache.org
> Reply
2010 19:11:11
To: java-user@lucene.apache.org
Reply-To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Hi folks,
Thanks for the excellent support n guidance on my very first day on this
mailing list...
At end of day, I have very optimistic results. 100bln search in less tha
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-somethingyou get the point" it
will go to all shards so the whole point of shards will be
compromi
anil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August 10, 2010 6:52 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
That won't work...if you'll have something like "A Basic Crazy
Document E-something F-something G-somethingyou get the point" i
of another option.
>
> Comments welcome.
>
>
> -Original Message-
> From: Danil ŢORIN [mailto:torin...@gmail.com]
> Sent: Tuesday, August 10, 2010 6:11 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> I'd second t
le indices (one for each token). So,
the same document may be indexed into Shard"A", "M", "N" and "D".
I am not able to think of another option.
Comments welcome.
-Original Message-
From: Danil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August
efficient merging algorithm.
>
> -Original Message-
> From: Dan OConnor [mailto:docon...@acquiremedia.com]
> Sent: Tuesday, August 10, 2010 6:02 PM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Shelly:
>
> You wouldn't
sage-
> From: Shelly_Singh [mailto:shelly_si...@infosys.com]
> Sent: Tuesday, August 10, 2010 8:20 AM
> To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> No sort. I will need relevance based on TF. If I shard, I will have to search
> in al indi
.
-Original Message-
From: Dan OConnor [mailto:docon...@acquiremedia.com]
Sent: Tuesday, August 10, 2010 6:02 PM
To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Shelly:
You wouldn't necessarily have to use a multisearcher. A suggested alternative
is:
- shard into 10 in
...@gmail.com]
Sent: Tuesday, August 10, 2010 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Searching on all in dices shouldn't be that bad an idea instead of searching
a single huge index, specially considering you have a constraint on the
usable memory
ly_si...@infosys.com]
Sent: Tuesday, August 10, 2010 8:20 AM
To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
No sort. I will need relevance based on TF. If I shard, I will have to search
in al indices.
-Original Message-
From: anshum.gu...@naukri.com [mailto:ansh...@gmai
gt; -Original Message-
> From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 1:54 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Would like to know, are you using a particular type of sort? Do you need to
No sort. I will need relevance based on TF. If I shard, I will have to search
in al indices.
-Original Message-
From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com]
Sent: Tuesday, August 10, 2010 1:54 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Would
me but the search
> time is highly unacceptable.
>
> Help again.
>
> -Original Message-
> From: Anshum [mailto:ansh...@gmail.com]
> Sent: Tuesday, August 10, 2010 12:55 PM
> To: java-user@lucene.apache.org
> Subject: Re: Scaling Lucene to 1bln docs
>
> Hi Shelly,
> That se
o: java-user@lucene.apache.org
> Reply-To: java-user@lucene.apache.org
> Subject: RE: Scaling Lucene to 1bln docs
>
> Hi Anshum,
>
> I am already running with the 'setCompoundFile' option off.
> And thanks for pointing out mergeFactor. I had tried a higher mergeFa
, 10 Aug 2010 13:31:38
To: java-user@lucene.apache.org
Reply-To: java-user@lucene.apache.org
Subject: RE: Scaling Lucene to 1bln docs
Hi Anshum,
I am already running with the 'setCompoundFile' option off.
And thanks for pointing out mergeFactor. I had tried a higher mergeFactor
coup
multisearcher for
searching. Will that help?
-Original Message-
From: Danil ŢORIN [mailto:torin...@gmail.com]
Sent: Tuesday, August 10, 2010 1:06 PM
To: java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
The problem actually won't be the indexing part.
Searching such
java-user@lucene.apache.org
Subject: Re: Scaling Lucene to 1bln docs
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your
mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
memory before writing it to a file (and incurring I/O). You could actual
The problem actually won't be the indexing part.
Searching such large dataset will require a LOT of memory.
If you'll need sorting or faceting on one of the fields, jvm will explode ;)
Also GC times on large jvm heap are pretty disturbing (if you care
about your search performance)
So I'd advise
Hi Shelly,
That seems like a reasonable data set size. I'd suggest you increase your
mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in
memory before writing it to a file (and incurring I/O). You could actually
flush by RAM usage instead of a Doc count. Turn off using the Co
cores.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Marcus Herou
> To: solr-u...@lucene.apache.org; java-user@lucene.apache.org
> Sent: Wednesday, July 1, 2009 10:31:28 AM
> Subject: Re: Scaling out/up or a mix
>
> Hi ag
Hi agree that faceting might be the thing that defines this app. The app is
mostly snappy during daytime since we optimize the index around 7.00 GMT.
However faceting is never snappy.
We speeded things up a whole bunch by creating various "less cardinal"
fields from the originating publishedDate w
On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote:
> The number of concurrent users today is insignficant but once we push
> for the service we will get into trouble... I know that since even one
> simple faceting query (which we will use to display trend graphs) can
> take forever (talking abo
Hi, like the sound of this.
What I am not familiar with in terms of Lucene is how the index get's
swapped in and out of memory. When it comes to database tables (non
partitionable tables at least) I know that one should have enough memory to
fit the entire index into memory to avoid file-sorts for
Hi.
The number of concurrent users today is insignficant but once we push for
the service we will get into trouble... I know that since even one simple
faceting query (which we will use to display trend graphs) can take forever
(talking about SOLR bytw). "Normal" Lucene queries (title:blah OR
desc
I have improved date-sorted searching performance pretty dramatically by
replacing the two step "search then sort" operation with a one step "use the
date as the score" algorithm. The main gotcha was making sure to not affect
which results get counted as hits in boolean searches, but overall I onl
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR
On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote:
> So the simple answer is always:
> If 64 bit platform with lots of RAM, use MMapDirectory.
Fair enough. That makes the RAM-focused solution much more scalable.
My point still stands though, as Marcus is currently examining his
hardware optio
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> > Index size(and growing): 16Gx8 = 128G
> > Doc size (data): 20k
> > Num docs: 90M
> > Num users: Few hundred but most critical is that the admin staff which
> is
> > using the index all day long.
> > Query types: Example: title:"Iphone" OR
On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote:
> Index size(and growing): 16Gx8 = 128G
> Doc size (data): 20k
> Num docs: 90M
> Num users: Few hundred but most critical is that the admin staff which is
> using the index all day long.
> Query types: Example: title:"Iphone" OR description:"I
uestion:
Based on your findings what is the most challenging part to tune ? Sorting
or querying or what else?
//Marcus
>
>
>
>
>
>
> - Original Message
> > From: Marcus Herou
> > To: java-user@lucene.apache.org
> > Sent: Monday, 29 June, 2009 9:47:
On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote:
> We currently have about 90M documents and it is increasing rapidly so
> getting into the G+ document range is not going to be too far away.
We've performed fairly extensive tests regarding hardware for searches
and some minor tests on hardwa
> From: Marcus Herou
> To: java-user@lucene.apache.org
> Sent: Monday, 29 June, 2009 9:47:13
> Subject: Re: Scaling out/up or a mix
>
> Thanks for the answer.
>
> Don't you think that part 1 of the email would give you a hint of nature of
> the index ?
>
>
Thanks for the answer.
Don't you think that part 1 of the email would give you a hint of nature of
the index ?
Index size(and growing): 16Gx8 = 128G
Doc size (data): 20k
Num docs: 90M
Num users: Few hundred but most critical is that the admin staff which is
using the index all day long.
Query typ
There is no single answer -- this is always application specific.
Without knowing anything about what you are doing:
1. disk i/o is probably the most critical. Go SSD or even RAM disk if
you can, if performance is absolutely critical
2. Sometimes CPU can become an issue, but 8 cores is probably
Hi. I think I need to be more specific.
What I am trying to find out is if I should aim for:
CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough.
Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough...
RAM - if the index does not fit into RAM how much RAM should I then buy ?
ntical top results to
the global idf policy for the vast majority of searches.
Cheers
Mark
- Original Message
From: Karl Wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 18 July, 2008 2:33:29 PM
Subject: Re: Scaling
18 jul 2008 kl. 09.49 skrev Eric Bo
RMI search in Lucene uses Searchable.int[] docFreqs(Term[] terms) to obtain
the docfreqs for all terms in a the query from each server. Which it then
turns into a globalized Weight that is submitted to all the Searchables
(servers). Look at MultiSearcher. This is fine for most systems even with
18 jul 2008 kl. 09.49 skrev Eric Bowman:
One thing I have trouble understanding is how scoring works in this
case. Does Lucene really "just work", or are there special things
we have to do to make sure that the scores are coherent so we can
actually decide which was the best match? What
Jason Rutherglen wrote:
The scaling per machine should be linear. The overhead from the network is
minimal because the Lucene object sizes are not impacting. Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
ind
The scaling per machine should be linear. The overhead from the network is
minimal because the Lucene object sizes are not impacting. Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
indexes which are now popular
A subset of your questions are answered (or at least examined) in my
postings on multi-thread queries on a multiple-core single system:
http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html
-Glen
200
Hi Murali (redirecting to the more appropriate java-user list)
Sounds doable. I'd go with FSDirectory (or even its memory mapped cousin)
instead of RAMDirectory - let the OS cache Lucene indices. I'm looking at a
search cluster with 3 times that many machines (but not as high-end as your 8
CP
Thanks for your comments and suggestions everyone :)
It looks like the general trend is to be in favour of (2) splitting
the frontend web application and the searching application.
Solr looks a lot like what we would liked, but unfortunately we
finished our application a while before Solr initia
Basically you need to separate your web app from your searching, for a
scalable solution. Searching is a different concern. You can develop more
kinds of search when new requirement comes in.
Technorati's way is very similar to one of DBSight configuration. One
machine is dedicated for indexing,
Hadoop is not designed for this type of scenario.
Have a look at Solr (http://lucene.apache.org/solr), this is pretty
much one of it's main use cases. I think it will do what you need to
do and will more than likely work w/ a minimal of configuration on
your existing index (but don't hold
Samuel LEMOINE a écrit :
> I'm acutely interrested by this issue too, as I'm working on
> distributed architecture of Lucene. I'm only at the very beginning of
> my study so that I can't help you much, but Hadoop maybe could fit to
> your requirements. It's a sub-project of Lucene aiming to paralle
Chun Wei Ho a écrit :
Hi,
We are currently running a Tomcat web application serving searches
over our Lucene index (10GB) on a single server machine (Dual 3GHz
CPU, 4GB RAM). Due to performance issues and to scale up to handle
more traffic/search requests, we are getting another server machine.
Server One handle website
Server Two is a light version of tomcat wich handle Lucene Search
In front, a lighttpd which use server two for /search, and server one
for all others things
You can add lucene server with round robin in lighttpd with this scheme.
Careful with fault tolerance and index
54 matches
Mail list logo