Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread lucene user
Hi, Folks -

Two quick questions - need to size a server to run our new index.

If I have an index with 111k articles and 90 million words indexed, how much
RAM should I have to get really fast access speeds?

If I have an index with 290k articles and 234 million words indexed, how
much RAM should I have to get really fast access speeds?

Any other advice about sizing a server?

What other info do you need to have to help size the server?

Does it matter if the server has a 64 bit processor?

Speed of processor important? Speed of disks?

Thanks!


Re: Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread karl wettin


12 aug 2007 kl. 09.03 skrev lucene user:

If I have an index with 111k articles and 90 million words indexed,  
how much

RAM should I have to get really fast access speeds?

If I have an index with 290k articles and 234 million words  
indexed, how

much RAM should I have to get really fast access speeds?


Define really fast.

I say you need 1.3x as much RAM as the size of your FSDirectory to  
ensure that the file system cache is never flushed out. But it also  
depends on user load. Each thread consumes RAM and CPU.


In order to really find out, setup the benchmarker to run on your  
index, and limit the amount of memory your file system chache and JVM  
is allowed.



Any other advice about sizing a server?
What other info do you need to have to help size the server?


Sizing?


Does it matter if the server has a 64 bit processor?


In a 64 bit environment a reference to an instance consumes twice as  
much RAM as in a 32 bit environment. It should not affect a file  
centric Lucene store (Directory), Your OS and your application that  
use Lucene might be consuming some more resources though. Again,  
benchmark.



Speed of processor important?


Yes.


Speed of disks?


May or may not be intersting depending on how much RAM you have.



--
karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread lucene user
Thanks, Karl.

Do you know if 290k articles and 234 million words is a large lucene index
or a medium one? Do people build them this big all the time?

Thanks!

On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 12 aug 2007 kl. 09.03 skrev lucene user:
>
> > If I have an index with 111k articles and 90 million words indexed,
> > how much
> > RAM should I have to get really fast access speeds?
> >
> > If I have an index with 290k articles and 234 million words
> > indexed, how
> > much RAM should I have to get really fast access speeds?
>
> Define really fast.
>
> I say you need 1.3x as much RAM as the size of your FSDirectory to
> ensure that the file system cache is never flushed out. But it also
> depends on user load. Each thread consumes RAM and CPU.
>
> In order to really find out, setup the benchmarker to run on your
> index, and limit the amount of memory your file system chache and JVM
> is allowed.
>
> > Any other advice about sizing a server?
> > What other info do you need to have to help size the server?
>
> Sizing?
>
> > Does it matter if the server has a 64 bit processor?
>
> In a 64 bit environment a reference to an instance consumes twice as
> much RAM as in a 32 bit environment. It should not affect a file
> centric Lucene store (Directory), Your OS and your application that
> use Lucene might be consuming some more resources though. Again,
> benchmark.
>
> > Speed of processor important?
>
> Yes.
>
> > Speed of disks?
>
> May or may not be intersting depending on how much RAM you have.
>
>
>
> --
> karl
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread karl wettin


12 aug 2007 kl. 14.01 skrev lucene user:

Do you know if 290k articles and 234 million words is a large  
lucene index

or a medium one? Do people build them this big all the time?


If the calculator in my head works you have 300k documents at 4k text  
each.


I say your corpus is borderline small.


--
karl


Thanks!

On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote:



12 aug 2007 kl. 09.03 skrev lucene user:


If I have an index with 111k articles and 90 million words indexed,
how much
RAM should I have to get really fast access speeds?

If I have an index with 290k articles and 234 million words
indexed, how
much RAM should I have to get really fast access speeds?


Define really fast.

I say you need 1.3x as much RAM as the size of your FSDirectory to
ensure that the file system cache is never flushed out. But it also
depends on user load. Each thread consumes RAM and CPU.

In order to really find out, setup the benchmarker to run on your
index, and limit the amount of memory your file system chache and JVM
is allowed.


Any other advice about sizing a server?
What other info do you need to have to help size the server?


Sizing?


Does it matter if the server has a 64 bit processor?


In a 64 bit environment a reference to an instance consumes twice as
much RAM as in a 32 bit environment. It should not affect a file
centric Lucene store (Directory), Your OS and your application that
use Lucene might be consuming some more resources though. Again,
benchmark.


Speed of processor important?


Yes.


Speed of disks?


May or may not be intersting depending on how much RAM you have.



--
karl


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Amount of RAM needed to support a growing lucene index?

2007-08-12 Thread eks dev
300k documents is something I would consider very small. Anything under 10Mio 
documents IMHO is small for Lucene (meaning, commodity hardware, 1G RAM should 
give you well under second response times).
The number of words is not all that important, much more important would be the 
number of unique words.

- Original Message 
From: lucene user <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Sunday, 12 August, 2007 2:01:28 PM
Subject: Re: Amount of RAM needed to support a growing lucene index?

Thanks, Karl.

Do you know if 290k articles and 234 million words is a large lucene index
or a medium one? Do people build them this big all the time?

Thanks!

On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 12 aug 2007 kl. 09.03 skrev lucene user:
>
> > If I have an index with 111k articles and 90 million words indexed,
> > how much
> > RAM should I have to get really fast access speeds?
> >
> > If I have an index with 290k articles and 234 million words
> > indexed, how
> > much RAM should I have to get really fast access speeds?
>
> Define really fast.
>
> I say you need 1.3x as much RAM as the size of your FSDirectory to
> ensure that the file system cache is never flushed out. But it also
> depends on user load. Each thread consumes RAM and CPU.
>
> In order to really find out, setup the benchmarker to run on your
> index, and limit the amount of memory your file system chache and JVM
> is allowed.
>
> > Any other advice about sizing a server?
> > What other info do you need to have to help size the server?
>
> Sizing?
>
> > Does it matter if the server has a 64 bit processor?
>
> In a 64 bit environment a reference to an instance consumes twice as
> much RAM as in a 32 bit environment. It should not affect a file
> centric Lucene store (Directory), Your OS and your application that
> use Lucene might be consuming some more resources though. Again,
> benchmark.
>
> > Speed of processor important?
>
> Yes.
>
> > Speed of disks?
>
> May or may not be intersting depending on how much RAM you have.
>
>
>
> --
> karl
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>





  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range queries in Lucene - numerical or lexicographical

2007-08-12 Thread Erick Erickson
As has been discussed several times, Lucene is a string-only engine, and
has no native understanding of numerical values. You have to normalize
them for string searches. See NumberTools.

Best
Erick

On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> Lucene query parser synax page
> (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides
> the following two examples of range query:
> mod_date:[20020101 TO 20030101]
> and
> title:{Aida TO Carmen}
>
> Now my question is, numerically 10 is greater than 2, but in
> string-only comparison 2 is greater than 10. So if I search for
> field:[10 TO 30]
> will a document with field=2 will be in result or not.
>
> And if I search for a string field,
> field:[AA TO CC]
> will document with field="B" will be in result or not.
>
> The semantics of range is not clear (numerical or lexicographical)
> from the documentation.
>
> thanks
> Nilesh
>
> --
> Nilesh Bansal.
> http://queens.db.toronto.edu/~nilesh/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Indexing correctly?

2007-08-12 Thread Erick Erickson
Where are your source files and index? If they're somewhere
out there on the network, you may be having some slowdown
because of network latency (the part about "/mount/." leads
me to ask this one).

If this is the case, you might get an improvement if all the files are
local...

Best
Erick

On 8/11/07, John Paul Sondag <[EMAIL PROTECTED]> wrote:
>
> It takes roughly 6 hours for me to index a Gig of data.  The benchmarks
> take
> quite a bit less if I'm reading it correctly.  I'll try out the
> StringBuffer/Builder and let you know.  Thanks for the quick response and
> if
> you have any more suggestions please let me know.
>
> --JP
>
> On 8/11/07, karl wettin <[EMAIL PROTECTED]> wrote:
> >
> > How much slower than anticipated is it?
> >
> > I would start by using a StringBuffer/Builder rather than appending
> > (immutable) strings to each other.
> >
> >
> > 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
> >
> > > Hi,
> > >
> > > I was hoping that maybe you guys could see if I'm somehow indexing
> > > inefficiently.  I'm putting relevant parts of my code below.  I've
> > > looked at
> > > the "benchmarks" page on Lucene and my indexing time is taking a
> > > substantial
> > > amount of time more than what I see posted.  I'm not sure when I
> > > should call
> > > flush() ( I saw that I should be doing that on the
> > > ImproveIndexingSpeed
> > > page).  I'd really appreciate any advice.
> > >
> > > Here's my code:
> > >
> > >   File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > > Dataset");
> > >   File[] theFiles = directory.listFiles();
> > >
> > >   //go through each file inside the directory and index it
> > >   for(int curFile = 0; curFile < theFiles.length; curFile++)
> > >   {
> > >   File fin=theFiles[curFile];
> > >
> > >   //open up the file
> > >   FileInputStream inf = new FileInputStream(fin);
> > >   InputStreamReader isr = new InputStreamReader(inf,
> > > "US-ASCII");
> > >   BufferedReader in = new BufferedReader(isr);
> > >   String text="";
> > >   String docid="";
> > >
> > > while (true) {
> > >
> > > //read in the file one line at a time, and act accordingly
> > > String line = in.readLine();
> > > if (line == null) { break;}
> > >
> > >  if (line.startsWith("") ) {
> > > //get docID
> > > line = in.readLine();
> > > String tempStr = line.substring(8,line.length());
> > > int pos = tempStr.indexOf(' ');
> > > docid = tempStr.substring(0,pos);
> > > }else if (line.startsWith("")) {
> > >
> > > Document doc = new Document();
> > >
> > >   doc.add(new Field("contents",text,
> > > Field.Store.NO,
> > > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> > > doc.add(new Field("DocID",docid, Field.Store.YES,
> > > Field.Index.NO));
> > >   writer.addDocument(doc);
> > > text="";
> > > } else {
> > > text = text + "\n" + line;
> > > }
> > > }
> > >
> > > }
> > >
> > >
> > >   int numIndexed = writer.docCount();
> > >
> > >   writer.optimize();
> > >   writer.close();
> > >
> > >
> > > Thanks,
> > >
> > > --JP
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


RE: High CPU usage duing index and search

2007-08-12 Thread Chew Yee Chuang
Hi testn,

I have tested Filter, it is pretty fast, but still take a lot of CPU resource, 
Maybe it could due to the number of filter I run.

Thank you
eChuang, Chew


-Original Message-
From: testn [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, August 07, 2007 10:37 PM
To: java-user@lucene.apache.org
Subject: RE: High CPU usage duing index and search


Check out Filter class. You can create a separate filter for each field and
then chain them together using ChainFilter. If you cache the filter, it will
be pretty fast. 


Chew Yee Chuang wrote:
> 
> Greetings,
> 
> Yes, process a little bit and stop for a while really reduce the CPU
> usage,
> but I need to find out a balance so that the indexing or searching will
> not
> have so much delay.
> 
> Execute 20,000 queries at a time is because the process is generating the
> aggregation data for reporting,
> E.g Gender (M,F), Department (Accounting, R&D, Financial,...etc), 
> 1Q - Gender:M AND Department: Accounting
> 2Q - Gender:M AND Department: R&D
> 3Q - Gender:M AND Department: Financial
> 4Q - Gender:F AND Department: Accounting
> 5Q - 
> Thus, the more combination, the more query need to run. For now, I still
> can't get any idea on how to reduce it, just thinking maybe there is a
> different way to index it so that I can get It easily.
> 
> Any help would be appreciated.
> 
> Thanks
> eChuang, Chew
> 
> -Original Message-
> From: karl wettin [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, August 02, 2007 7:11 AM
> To: java-user@lucene.apache.org
> Subject: Re: High CPU usage duing index and search
> 
> It sounds like you have a fairly busy system, perhaps 100% load on the
> process is not that strange, at least not during short periods of time.
> 
> A simpler solution would be to nice the process a little bit in order to
> give your background jobs some more time to think.
> 
> Running a profiler is still the best advice I can think of. It should
> clearly show you what is going on when you run out of CPU.
> 
> --  
> karl
> 
> 1 aug 2007 kl. 04.29 skrev Chew Yee Chuang:
> 
>> Hi,
>>
>> Thanks for the link provided, actually I've go through those  
>> article when I
>> developing the index and search function for my application. I  
>> haven’t try
>> profiler yet, but I monitor the CPU usage and notice that whatever  
>> index or
>> search performing, the CPU usage raise to 100%. Below I will try to
>> elaborate more on what my application is doing and how I index and  
>> search.
>>
>> There are many concurrent process running, first, the application  
>> will write
>> records that received into a text file with tab separated each  
>> different
>> field. Application will point to a new file every 10mins and start  
>> writing
>> to it. So every file will contains only 10mins record, approximate  
>> 600,000
>> records per file. Then, the indexing process will check whether  
>> there is a
>> text file to be index, if it is, the thread will wake up and start  
>> perform
>> indexing.
>>
>> The indexing process will first add documents to RAMDir, Then  
>> later, add
>> RAMDir into FSDir by calling addIndexNoOptimize() when there is  
>> 100,000
>> documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter 
>> (FSDir)
>> was created but a few IndexWriter(RAMDir) was created during the whole
>> process. Below are some configuration for IndexWriters that I  
>> mentioned:-
>>
>> IndexWriter (RAMDir)
>> - SimpleAnalyzer
>> - setMaxBufferedDocs(1)
>> - Filed.Store.YES
>> - Field.Index.NO_NORMS
>>
>> IndexWriter (FSDir)
>> - SimpleAnalyzer
>> - setMergeFactor(20)
>> - addIndexesNoOptimize()
>>
>> For the searching, because there are many queries(20,000) run  
>> continuously
>> to generate the aggregate table for reporting purpose. All this  
>> queries is
>> run in nested loop, and there is only 1 Searcher created, I try  
>> searcher and
>> filter as well, filter give me better result, but both also utilize  
>> lots of
>> CPU resources.
>>
>> Hope this info will help and sorry for my bad English.
>>
>> Thanks
>> eChuang, Chew
>>
>> -Original Message-
>> From: karl wettin [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, July 31, 2007 5:54 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: High CPU usage duing index and search
>>
>>
>> 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang:
>>> But just notice that when Lucene performing search or index,
>>> the CPU usage on my machine raise to 100%, because of this issue,
>>> some of my
>>> others backend process will slow down eventually. Just want to know
>>> does
>>> anyone face this problem before ? and is it any idea on how to
>>> overcome this
>>> problem ?
>>
>> Did you run a profiler to see what it is that consume all the  
>> resources?
>> It is very hard to guess based on the information you supplied. Start
>> here:
>>
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
>> http://wiki.apache.org/lucene-java/Impr

Nested concept fields

2007-08-12 Thread Jeff French

I'm trying to index concepts within a document, and search them within the
context of a multivalued field. I'm not even sure it's possible with the
QueryParser or QsolParser syntax. Does anyone know if it is / is not
possible? If not, is it conceptually possible using the Query API?

What I'd like to do: I'm currently indexing sentences as individual 'sent'
fields. I plan to create indices on other information I parse out of a
document (e.g. numerics, people names, company names). Suppose I call my
numeric index 'num'. Then I would like to do something like this (in search
pseudocode):

sent:(expired num[1 TO 5] "days ago")

I don't see how to do this using either Lucene's QueryParser or the
QsolParser. Is it possible to do it using the Query API (and the appropriate
indexing changes)?

Thanks for any pointers.

Jeff
-- 
View this message in context: 
http://www.nabble.com/Nested-concept-fields-tf4258984.html#a12120367
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



performance on filtering against thousands of different publications

2007-08-12 Thread Cedric Ho
Hi all,

My problem is as follows:

Our documents each comes from a different publication. And we
currently have > 5000 different publication sources.

Our clients can choose arbitrarily a subset of the publications while
performing search. It is not  uncommon that a search will have to
match hundreds or thousands of publications.

I currently try to index the publication information as a field in
each document. and use a TermsFilter when performing search. However
the performance is less than satisfactory. Many simple searches takes
more than 2-3 seconds. (our goal: < 0.5seconds).

Using the CachingWrapperFilter is great for search speed. But I've
done some calculation and figured that it is basically impossible to
cache all combination of publications or even some common
combinations.


Is there any other more effective way to do the filtering?

(I know that the slowness is not purely due to the publication filter,
we also have some other things that will slow down the search. But
this one definitely contributed quite a lot to the overall search
time)

Regards,
Cedric

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range queries in Lucene - numerical or lexicographical

2007-08-12 Thread Nilesh Bansal
Thanks. Probably this should be mentioned on the documentation page.

-Nilesh

On 8/12/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> As has been discussed several times, Lucene is a string-only engine, and
> has no native understanding of numerical values. You have to normalize
> them for string searches. See NumberTools.
>
> Best
> Erick
>
> On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote:
> >
> > Hi all,
> >
> > Lucene query parser synax page
> > (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides
> > the following two examples of range query:
> > mod_date:[20020101 TO 20030101]
> > and
> > title:{Aida TO Carmen}
> >
> > Now my question is, numerically 10 is greater than 2, but in
> > string-only comparison 2 is greater than 10. So if I search for
> > field:[10 TO 30]
> > will a document with field=2 will be in result or not.
> >
> > And if I search for a string field,
> > field:[AA TO CC]
> > will document with field="B" will be in result or not.
> >
> > The semantics of range is not clear (numerical or lexicographical)
> > from the documentation.
> >
> > thanks
> > Nilesh
> >
> > --
> > Nilesh Bansal.
> > http://queens.db.toronto.edu/~nilesh/
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>


-- 
Nilesh Bansal.
http://queens.db.toronto.edu/~nilesh/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Range queries in Lucene - numerical or lexicographical

2007-08-12 Thread Mohammad Norouzi
Thanks Erick but unfortunately NumberTools works only with long primitive
type I am wondering why you didn't put some method for double and float.



On 8/13/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote:
>
> Thanks. Probably this should be mentioned on the documentation page.
>
> -Nilesh
>
> On 8/12/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
> > As has been discussed several times, Lucene is a string-only engine, and
> > has no native understanding of numerical values. You have to normalize
> > them for string searches. See NumberTools.
> >
> > Best
> > Erick
> >
> > On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi all,
> > >
> > > Lucene query parser synax page
> > > (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides
> > > the following two examples of range query:
> > > mod_date:[20020101 TO 20030101]
> > > and
> > > title:{Aida TO Carmen}
> > >
> > > Now my question is, numerically 10 is greater than 2, but in
> > > string-only comparison 2 is greater than 10. So if I search for
> > > field:[10 TO 30]
> > > will a document with field=2 will be in result or not.
> > >
> > > And if I search for a string field,
> > > field:[AA TO CC]
> > > will document with field="B" will be in result or not.
> > >
> > > The semantics of range is not clear (numerical or lexicographical)
> > > from the documentation.
> > >
> > > thanks
> > > Nilesh
> > >
> > > --
> > > Nilesh Bansal.
> > > http://queens.db.toronto.edu/~nilesh/
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>
>
> --
> Nilesh Bansal.
> http://queens.db.toronto.edu/~nilesh/
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/


Index file size limitation of 2GB

2007-08-12 Thread rohit saini
Hi all,

I have bulk of data to be indexed and that may cross index file size of 2GB.
As lucene faq tells that if index file size increses to 2GB there will be
problems. but faq tells to make index subdirectory in this case. I have
tried to do so made a index subdirectory in index main directory when index
file size increses to 2GB but during search I don't get any result from
index subdirectory. do I need to search recursively but in that case there
will be more than "hits" object then how to combine them and return a single
result to the user? Please tell me 

Thanks & Regards,

Rohit


Re: Range queries in Lucene - numerical or lexicographical

2007-08-12 Thread Chris Hostetter

: Subject: Re: Range queries in Lucene - numerical or lexicographical
:
: Thanks. Probably this should be mentioned on the documentation page.

it does say right above the "date" example: "  Sorting is done
lexicographically."

(Admitedly I'm not sure why the word "Sorting" is used in that sentence,
but it should make it clear that it's a lexicographical comparison)

patches to improve documentation are always appreciated!




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]