Re: Scaling Lucene to 1bln docs

2010-08-16 Thread Danil ŢORIN
m [mailto:ansh...@gmail.com] > Sent: Wednesday, August 11, 2010 10:38 AM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > So, you didn't really use the setRamBuffer.. ? > Any reasons for that? > > -- > Anshum Gupta > http://ai-cafe.blogspot.c

RE: Scaling Lucene to 1bln docs

2010-08-16 Thread Shelly_Singh
nfosys.com Phone: (M) 91 992 369 7200, (VoIP)2022978622 -Original Message- From: Anshum [mailto:ansh...@gmail.com] Sent: Wednesday, August 11, 2010 10:38 AM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs So, you didn't really use the setRamBuffer.. ? Any r

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
gt; > -Original Message- > From: Pablo Mendes [mailto:pablomen...@gmail.com] > Sent: Tuesday, August 10, 2010 7:22 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
- > From: Pablo Mendes [mailto:pablomen...@gmail.com] > Sent: Tuesday, August 10, 2010 7:22 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Shelly, > Do you mind sharing with the list the final settings you used for your best > results?

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
compare with regular docs. -Original Message- From: Pablo Mendes [mailto:pablomen...@gmail.com] Sent: Tuesday, August 10, 2010 7:22 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Shelly, Do you mind sharing with the list the final settings you used for your best

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Pablo Mendes
2nd Ed. It'll help you get a hang of a lot of things! :) > > -- > Anshum > http://blog.anshumgupta.net > > Sent from BlackBerry® > > -Original Message- > From: Shelly_Singh > Date: Tue, 10 Aug 2010 19:11:11 > To: java-user@lucene.apache.org > Reply

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
2010 19:11:11 To: java-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Hi folks, Thanks for the excellent support n guidance on my very first day on this mailing list... At end of day, I have very optimistic results. 100bln search in less tha

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-somethingyou get the point" it will go to all shards so the whole point of shards will be compromi

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
anil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August 10, 2010 6:52 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs That won't work...if you'll have something like "A Basic Crazy Document E-something F-something G-somethingyou get the point" i

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
of another option. > > Comments welcome. > > > -Original Message- > From: Danil ŢORIN [mailto:torin...@gmail.com] > Sent: Tuesday, August 10, 2010 6:11 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > I'd second t

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
le indices (one for each token). So, the same document may be indexed into Shard"A", "M", "N" and "D". I am not able to think of another option. Comments welcome. -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread prashant ullegaddi
efficient merging algorithm. > > -Original Message- > From: Dan OConnor [mailto:docon...@acquiremedia.com] > Sent: Tuesday, August 10, 2010 6:02 PM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Shelly: > > You wouldn't

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
sage- > From: Shelly_Singh [mailto:shelly_si...@infosys.com] > Sent: Tuesday, August 10, 2010 8:20 AM > To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > No sort. I will need relevance based on TF. If I shard, I will have to search > in al indi

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
. -Original Message- From: Dan OConnor [mailto:docon...@acquiremedia.com] Sent: Tuesday, August 10, 2010 6:02 PM To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Shelly: You wouldn't necessarily have to use a multisearcher. A suggested alternative is: - shard into 10 in

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
...@gmail.com] Sent: Tuesday, August 10, 2010 5:59 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Searching on all in dices shouldn't be that bad an idea instead of searching a single huge index, specially considering you have a constraint on the usable memory

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Dan OConnor
ly_si...@infosys.com] Sent: Tuesday, August 10, 2010 8:20 AM To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -Original Message- From: anshum.gu...@naukri.com [mailto:ansh...@gmai

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
gt; -Original Message- > From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 1:54 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Would like to know, are you using a particular type of sort? Do you need to

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
No sort. I will need relevance based on TF. If I shard, I will have to search in al indices. -Original Message- From: anshum.gu...@naukri.com [mailto:ansh...@gmail.com] Sent: Tuesday, August 10, 2010 1:54 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Would

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread findbestopensource
me but the search > time is highly unacceptable. > > Help again. > > -Original Message- > From: Anshum [mailto:ansh...@gmail.com] > Sent: Tuesday, August 10, 2010 12:55 PM > To: java-user@lucene.apache.org > Subject: Re: Scaling Lucene to 1bln docs > > Hi Shelly, > That se

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Michael McCandless
o: java-user@lucene.apache.org > Reply-To: java-user@lucene.apache.org > Subject: RE: Scaling Lucene to 1bln docs > > Hi Anshum, > > I am already running with the 'setCompoundFile' option off. > And thanks for pointing out mergeFactor. I had tried a higher mergeFa

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread anshum.gu...@naukri.com
, 10 Aug 2010 13:31:38 To: java-user@lucene.apache.org Reply-To: java-user@lucene.apache.org Subject: RE: Scaling Lucene to 1bln docs Hi Anshum, I am already running with the 'setCompoundFile' option off. And thanks for pointing out mergeFactor. I had tried a higher mergeFactor coup

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
multisearcher for searching. Will that help? -Original Message- From: Danil ŢORIN [mailto:torin...@gmail.com] Sent: Tuesday, August 10, 2010 1:06 PM To: java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs The problem actually won't be the indexing part. Searching such

RE: Scaling Lucene to 1bln docs

2010-08-10 Thread Shelly_Singh
java-user@lucene.apache.org Subject: Re: Scaling Lucene to 1bln docs Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actual

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Danil ŢORIN
The problem actually won't be the indexing part. Searching such large dataset will require a LOT of memory. If you'll need sorting or faceting on one of the fields, jvm will explode ;) Also GC times on large jvm heap are pretty disturbing (if you care about your search performance) So I'd advise

Re: Scaling Lucene to 1bln docs

2010-08-10 Thread Anshum
Hi Shelly, That seems like a reasonable data set size. I'd suggest you increase your mergeFactor as a mergeFactor of 10 says, you are only buffering 10 docs in memory before writing it to a file (and incurring I/O). You could actually flush by RAM usage instead of a Doc count. Turn off using the Co

Re: Scaling out/up or a mix

2009-07-02 Thread Otis Gospodnetic
cores. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Marcus Herou > To: solr-u...@lucene.apache.org; java-user@lucene.apache.org > Sent: Wednesday, July 1, 2009 10:31:28 AM > Subject: Re: Scaling out/up or a mix > > Hi ag

Re: Scaling out/up or a mix

2009-07-01 Thread Marcus Herou
Hi agree that faceting might be the thing that defines this app. The app is mostly snappy during daytime since we optimize the index around 7.00 GMT. However faceting is never snappy. We speeded things up a whole bunch by creating various "less cardinal" fields from the originating publishedDate w

Re: Scaling out/up or a mix

2009-07-01 Thread Toke Eskildsen
On Tue, 2009-06-30 at 22:59 +0200, Marcus Herou wrote: > The number of concurrent users today is insignficant but once we push > for the service we will get into trouble... I know that since even one > simple faceting query (which we will use to display trend graphs) can > take forever (talking abo

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi, like the sound of this. What I am not familiar with in terms of Lucene is how the index get's swapped in and out of memory. When it comes to database tables (non partitionable tables at least) I know that one should have enough memory to fit the entire index into memory to avoid file-sorts for

Re: Scaling out/up or a mix

2009-06-30 Thread Marcus Herou
Hi. The number of concurrent users today is insignficant but once we push for the service we will get into trouble... I know that since even one simple faceting query (which we will use to display trend graphs) can take forever (talking about SOLR bytw). "Normal" Lucene queries (title:blah OR desc

Re: Scaling out/up or a mix

2009-06-30 Thread Andy Goodell
I have improved date-sorted searching performance pretty dramatically by replacing the two step "search then sort" operation with a one step "use the date as the score" algorithm. The main gotcha was making sure to not affect which results get counted as hits in boolean searches, but overall I onl

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

RE: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Tue, 2009-06-30 at 11:29 +0200, Uwe Schindler wrote: > So the simple answer is always: > If 64 bit platform with lots of RAM, use MMapDirectory. Fair enough. That makes the RAM-focused solution much more scalable. My point still stands though, as Marcus is currently examining his hardware optio

RE: Scaling out/up or a mix

2009-06-30 Thread Uwe Schindler
> On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > > Index size(and growing): 16Gx8 = 128G > > Doc size (data): 20k > > Num docs: 90M > > Num users: Few hundred but most critical is that the admin staff which > is > > using the index all day long. > > Query types: Example: title:"Iphone" OR

Re: Scaling out/up or a mix

2009-06-30 Thread Toke Eskildsen
On Mon, 2009-06-29 at 09:47 +0200, Marcus Herou wrote: > Index size(and growing): 16Gx8 = 128G > Doc size (data): 20k > Num docs: 90M > Num users: Few hundred but most critical is that the admin staff which is > using the index all day long. > Query types: Example: title:"Iphone" OR description:"I

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
uestion: Based on your findings what is the most challenging part to tune ? Sorting or querying or what else? //Marcus > > > > > > > - Original Message > > From: Marcus Herou > > To: java-user@lucene.apache.org > > Sent: Monday, 29 June, 2009 9:47:

Re: Scaling out/up or a mix

2009-06-29 Thread Toke Eskildsen
On Sat, 2009-06-27 at 00:00 +0200, Marcus Herou wrote: > We currently have about 90M documents and it is increasing rapidly so > getting into the G+ document range is not going to be too far away. We've performed fairly extensive tests regarding hardware for searches and some minor tests on hardwa

Re: Scaling out/up or a mix

2009-06-29 Thread eks dev
> From: Marcus Herou > To: java-user@lucene.apache.org > Sent: Monday, 29 June, 2009 9:47:13 > Subject: Re: Scaling out/up or a mix > > Thanks for the answer. > > Don't you think that part 1 of the email would give you a hint of nature of > the index ? > >

Re: Scaling out/up or a mix

2009-06-29 Thread Marcus Herou
Thanks for the answer. Don't you think that part 1 of the email would give you a hint of nature of the index ? Index size(and growing): 16Gx8 = 128G Doc size (data): 20k Num docs: 90M Num users: Few hundred but most critical is that the admin staff which is using the index all day long. Query typ

Re: Scaling out/up or a mix

2009-06-28 Thread Eric Bowman
There is no single answer -- this is always application specific. Without knowing anything about what you are doing: 1. disk i/o is probably the most critical. Go SSD or even RAM disk if you can, if performance is absolutely critical 2. Sometimes CPU can become an issue, but 8 cores is probably

Re: Scaling out/up or a mix

2009-06-28 Thread Marcus Herou
Hi. I think I need to be more specific. What I am trying to find out is if I should aim for: CPU (2x4 cores, 2.0-3.0Ghz)? or perhaps just a 4 cores is enough. Fast disk IO: 8 disks, RAID1+0 ? or perhaps 2 disks is enough... RAM - if the index does not fit into RAM how much RAM should I then buy ?

Re: Scaling

2008-07-18 Thread mark harwood
ntical top results to the global idf policy for the vast majority of searches. Cheers Mark - Original Message From: Karl Wettin <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 18 July, 2008 2:33:29 PM Subject: Re: Scaling 18 jul 2008 kl. 09.49 skrev Eric Bo

Re: Scaling

2008-07-18 Thread Jason Rutherglen
RMI search in Lucene uses Searchable.int[] docFreqs(Term[] terms) to obtain the docfreqs for all terms in a the query from each server. Which it then turns into a globalized Weight that is submitted to all the Searchables (servers). Look at MultiSearcher. This is fine for most systems even with

Re: Scaling

2008-07-18 Thread Karl Wettin
18 jul 2008 kl. 09.49 skrev Eric Bowman: One thing I have trouble understanding is how scoring works in this case. Does Lucene really "just work", or are there special things we have to do to make sure that the scores are coherent so we can actually decide which was the best match? What

Re: Scaling

2008-07-18 Thread Eric Bowman
Jason Rutherglen wrote: The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub ind

Re: Scaling

2008-07-17 Thread Jason Rutherglen
The scaling per machine should be linear. The overhead from the network is minimal because the Lucene object sizes are not impacting. Google mentions in one of their early white papers on scaling http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub indexes which are now popular

Re: Scaling

2008-07-16 Thread Glen Newton
A subset of your questions are answered (or at least examined) in my postings on multi-thread queries on a multiple-core single system: http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html -Glen 200

Re: Scaling Lucene to 500 million documents - preferred architecture

2007-07-16 Thread Otis Gospodnetic
Hi Murali (redirecting to the more appropriate java-user list) Sounds doable. I'd go with FSDirectory (or even its memory mapped cousin) instead of RAMDirectory - let the OS cache Lucene indices. I'm looking at a search cluster with 3 times that many machines (but not as high-end as your 8 CP

Re: Scaling up to several machines with Lucene

2007-07-07 Thread Chun Wei Ho
Thanks for your comments and suggestions everyone :) It looks like the general trend is to be in favour of (2) splitting the frontend web application and the searching application. Solr looks a lot like what we would liked, but unfortunately we finished our application a while before Solr initia

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Chris Lu
Basically you need to separate your web app from your searching, for a scalable solution. Searching is a different concern. You can develop more kinds of search when new requirement comes in. Technorati's way is very similar to one of DBSight configuration. One machine is dedicated for indexing,

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Grant Ingersoll
Hadoop is not designed for this type of scenario. Have a look at Solr (http://lucene.apache.org/solr), this is pretty much one of it's main use cases. I think it will do what you need to do and will more than likely work w/ a minimal of configuration on your existing index (but don't hold

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Mathieu Lecarme
Samuel LEMOINE a écrit : > I'm acutely interrested by this issue too, as I'm working on > distributed architecture of Lucene. I'm only at the very beginning of > my study so that I can't help you much, but Hadoop maybe could fit to > your requirements. It's a sub-project of Lucene aiming to paralle

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Samuel LEMOINE
Chun Wei Ho a écrit : Hi, We are currently running a Tomcat web application serving searches over our Lucene index (10GB) on a single server machine (Dual 3GHz CPU, 4GB RAM). Due to performance issues and to scale up to handle more traffic/search requests, we are getting another server machine.

Re: Scaling up to several machines with Lucene

2007-06-28 Thread Mathieu Lecarme
Server One handle website Server Two is a light version of tomcat wich handle Lucene Search In front, a lighttpd which use server two for /search, and server one for all others things You can add lucene server with round robin in lighttpd with this scheme. Careful with fault tolerance and index