Amount of RAM needed to support a growing lucene index?
Hi, Folks - Two quick questions - need to size a server to run our new index. If I have an index with 111k articles and 90 million words indexed, how much RAM should I have to get really fast access speeds? If I have an index with 290k articles and 234 million words indexed, how much RAM should I have to get really fast access speeds? Any other advice about sizing a server? What other info do you need to have to help size the server? Does it matter if the server has a 64 bit processor? Speed of processor important? Speed of disks? Thanks!
Re: Amount of RAM needed to support a growing lucene index?
12 aug 2007 kl. 09.03 skrev lucene user: If I have an index with 111k articles and 90 million words indexed, how much RAM should I have to get really fast access speeds? If I have an index with 290k articles and 234 million words indexed, how much RAM should I have to get really fast access speeds? Define really fast. I say you need 1.3x as much RAM as the size of your FSDirectory to ensure that the file system cache is never flushed out. But it also depends on user load. Each thread consumes RAM and CPU. In order to really find out, setup the benchmarker to run on your index, and limit the amount of memory your file system chache and JVM is allowed. Any other advice about sizing a server? What other info do you need to have to help size the server? Sizing? Does it matter if the server has a 64 bit processor? In a 64 bit environment a reference to an instance consumes twice as much RAM as in a 32 bit environment. It should not affect a file centric Lucene store (Directory), Your OS and your application that use Lucene might be consuming some more resources though. Again, benchmark. Speed of processor important? Yes. Speed of disks? May or may not be intersting depending on how much RAM you have. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Amount of RAM needed to support a growing lucene index?
Thanks, Karl. Do you know if 290k articles and 234 million words is a large lucene index or a medium one? Do people build them this big all the time? Thanks! On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote: > > > 12 aug 2007 kl. 09.03 skrev lucene user: > > > If I have an index with 111k articles and 90 million words indexed, > > how much > > RAM should I have to get really fast access speeds? > > > > If I have an index with 290k articles and 234 million words > > indexed, how > > much RAM should I have to get really fast access speeds? > > Define really fast. > > I say you need 1.3x as much RAM as the size of your FSDirectory to > ensure that the file system cache is never flushed out. But it also > depends on user load. Each thread consumes RAM and CPU. > > In order to really find out, setup the benchmarker to run on your > index, and limit the amount of memory your file system chache and JVM > is allowed. > > > Any other advice about sizing a server? > > What other info do you need to have to help size the server? > > Sizing? > > > Does it matter if the server has a 64 bit processor? > > In a 64 bit environment a reference to an instance consumes twice as > much RAM as in a 32 bit environment. It should not affect a file > centric Lucene store (Directory), Your OS and your application that > use Lucene might be consuming some more resources though. Again, > benchmark. > > > Speed of processor important? > > Yes. > > > Speed of disks? > > May or may not be intersting depending on how much RAM you have. > > > > -- > karl > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Amount of RAM needed to support a growing lucene index?
12 aug 2007 kl. 14.01 skrev lucene user: Do you know if 290k articles and 234 million words is a large lucene index or a medium one? Do people build them this big all the time? If the calculator in my head works you have 300k documents at 4k text each. I say your corpus is borderline small. -- karl Thanks! On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote: 12 aug 2007 kl. 09.03 skrev lucene user: If I have an index with 111k articles and 90 million words indexed, how much RAM should I have to get really fast access speeds? If I have an index with 290k articles and 234 million words indexed, how much RAM should I have to get really fast access speeds? Define really fast. I say you need 1.3x as much RAM as the size of your FSDirectory to ensure that the file system cache is never flushed out. But it also depends on user load. Each thread consumes RAM and CPU. In order to really find out, setup the benchmarker to run on your index, and limit the amount of memory your file system chache and JVM is allowed. Any other advice about sizing a server? What other info do you need to have to help size the server? Sizing? Does it matter if the server has a 64 bit processor? In a 64 bit environment a reference to an instance consumes twice as much RAM as in a 32 bit environment. It should not affect a file centric Lucene store (Directory), Your OS and your application that use Lucene might be consuming some more resources though. Again, benchmark. Speed of processor important? Yes. Speed of disks? May or may not be intersting depending on how much RAM you have. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Amount of RAM needed to support a growing lucene index?
300k documents is something I would consider very small. Anything under 10Mio documents IMHO is small for Lucene (meaning, commodity hardware, 1G RAM should give you well under second response times). The number of words is not all that important, much more important would be the number of unique words. - Original Message From: lucene user <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Sunday, 12 August, 2007 2:01:28 PM Subject: Re: Amount of RAM needed to support a growing lucene index? Thanks, Karl. Do you know if 290k articles and 234 million words is a large lucene index or a medium one? Do people build them this big all the time? Thanks! On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote: > > > 12 aug 2007 kl. 09.03 skrev lucene user: > > > If I have an index with 111k articles and 90 million words indexed, > > how much > > RAM should I have to get really fast access speeds? > > > > If I have an index with 290k articles and 234 million words > > indexed, how > > much RAM should I have to get really fast access speeds? > > Define really fast. > > I say you need 1.3x as much RAM as the size of your FSDirectory to > ensure that the file system cache is never flushed out. But it also > depends on user load. Each thread consumes RAM and CPU. > > In order to really find out, setup the benchmarker to run on your > index, and limit the amount of memory your file system chache and JVM > is allowed. > > > Any other advice about sizing a server? > > What other info do you need to have to help size the server? > > Sizing? > > > Does it matter if the server has a 64 bit processor? > > In a 64 bit environment a reference to an instance consumes twice as > much RAM as in a 32 bit environment. It should not affect a file > centric Lucene store (Directory), Your OS and your application that > use Lucene might be consuming some more resources though. Again, > benchmark. > > > Speed of processor important? > > Yes. > > > Speed of disks? > > May or may not be intersting depending on how much RAM you have. > > > > -- > karl > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ___ Yahoo! Answers - Got a question? Someone out there knows the answer. Try it now. http://uk.answers.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Range queries in Lucene - numerical or lexicographical
As has been discussed several times, Lucene is a string-only engine, and has no native understanding of numerical values. You have to normalize them for string searches. See NumberTools. Best Erick On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote: > > Hi all, > > Lucene query parser synax page > (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides > the following two examples of range query: > mod_date:[20020101 TO 20030101] > and > title:{Aida TO Carmen} > > Now my question is, numerically 10 is greater than 2, but in > string-only comparison 2 is greater than 10. So if I search for > field:[10 TO 30] > will a document with field=2 will be in result or not. > > And if I search for a string field, > field:[AA TO CC] > will document with field="B" will be in result or not. > > The semantics of range is not clear (numerical or lexicographical) > from the documentation. > > thanks > Nilesh > > -- > Nilesh Bansal. > http://queens.db.toronto.edu/~nilesh/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Indexing correctly?
Where are your source files and index? If they're somewhere out there on the network, you may be having some slowdown because of network latency (the part about "/mount/." leads me to ask this one). If this is the case, you might get an improvement if all the files are local... Best Erick On 8/11/07, John Paul Sondag <[EMAIL PROTECTED]> wrote: > > It takes roughly 6 hours for me to index a Gig of data. The benchmarks > take > quite a bit less if I'm reading it correctly. I'll try out the > StringBuffer/Builder and let you know. Thanks for the quick response and > if > you have any more suggestions please let me know. > > --JP > > On 8/11/07, karl wettin <[EMAIL PROTECTED]> wrote: > > > > How much slower than anticipated is it? > > > > I would start by using a StringBuffer/Builder rather than appending > > (immutable) strings to each other. > > > > > > 11 aug 2007 kl. 19.05 skrev John Paul Sondag: > > > > > Hi, > > > > > > I was hoping that maybe you guys could see if I'm somehow indexing > > > inefficiently. I'm putting relevant parts of my code below. I've > > > looked at > > > the "benchmarks" page on Lucene and my indexing time is taking a > > > substantial > > > amount of time more than what I see posted. I'm not sure when I > > > should call > > > flush() ( I saw that I should be doing that on the > > > ImproveIndexingSpeed > > > page). I'd really appreciate any advice. > > > > > > Here's my code: > > > > > > File directory = new File( "/mounts/falcon5/disks/0/tcheng3/ > > > Dataset"); > > > File[] theFiles = directory.listFiles(); > > > > > > //go through each file inside the directory and index it > > > for(int curFile = 0; curFile < theFiles.length; curFile++) > > > { > > > File fin=theFiles[curFile]; > > > > > > //open up the file > > > FileInputStream inf = new FileInputStream(fin); > > > InputStreamReader isr = new InputStreamReader(inf, > > > "US-ASCII"); > > > BufferedReader in = new BufferedReader(isr); > > > String text=""; > > > String docid=""; > > > > > > while (true) { > > > > > > //read in the file one line at a time, and act accordingly > > > String line = in.readLine(); > > > if (line == null) { break;} > > > > > > if (line.startsWith("") ) { > > > //get docID > > > line = in.readLine(); > > > String tempStr = line.substring(8,line.length()); > > > int pos = tempStr.indexOf(' '); > > > docid = tempStr.substring(0,pos); > > > }else if (line.startsWith("")) { > > > > > > Document doc = new Document(); > > > > > > doc.add(new Field("contents",text, > > > Field.Store.NO, > > > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS )); > > > doc.add(new Field("DocID",docid, Field.Store.YES, > > > Field.Index.NO)); > > > writer.addDocument(doc); > > > text=""; > > > } else { > > > text = text + "\n" + line; > > > } > > > } > > > > > > } > > > > > > > > > int numIndexed = writer.docCount(); > > > > > > writer.optimize(); > > > writer.close(); > > > > > > > > > Thanks, > > > > > > --JP > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > >
RE: High CPU usage duing index and search
Hi testn, I have tested Filter, it is pretty fast, but still take a lot of CPU resource, Maybe it could due to the number of filter I run. Thank you eChuang, Chew -Original Message- From: testn [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 07, 2007 10:37 PM To: java-user@lucene.apache.org Subject: RE: High CPU usage duing index and search Check out Filter class. You can create a separate filter for each field and then chain them together using ChainFilter. If you cache the filter, it will be pretty fast. Chew Yee Chuang wrote: > > Greetings, > > Yes, process a little bit and stop for a while really reduce the CPU > usage, > but I need to find out a balance so that the indexing or searching will > not > have so much delay. > > Execute 20,000 queries at a time is because the process is generating the > aggregation data for reporting, > E.g Gender (M,F), Department (Accounting, R&D, Financial,...etc), > 1Q - Gender:M AND Department: Accounting > 2Q - Gender:M AND Department: R&D > 3Q - Gender:M AND Department: Financial > 4Q - Gender:F AND Department: Accounting > 5Q - > Thus, the more combination, the more query need to run. For now, I still > can't get any idea on how to reduce it, just thinking maybe there is a > different way to index it so that I can get It easily. > > Any help would be appreciated. > > Thanks > eChuang, Chew > > -Original Message- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 02, 2007 7:11 AM > To: java-user@lucene.apache.org > Subject: Re: High CPU usage duing index and search > > It sounds like you have a fairly busy system, perhaps 100% load on the > process is not that strange, at least not during short periods of time. > > A simpler solution would be to nice the process a little bit in order to > give your background jobs some more time to think. > > Running a profiler is still the best advice I can think of. It should > clearly show you what is going on when you run out of CPU. > > -- > karl > > 1 aug 2007 kl. 04.29 skrev Chew Yee Chuang: > >> Hi, >> >> Thanks for the link provided, actually I've go through those >> article when I >> developing the index and search function for my application. I >> haven’t try >> profiler yet, but I monitor the CPU usage and notice that whatever >> index or >> search performing, the CPU usage raise to 100%. Below I will try to >> elaborate more on what my application is doing and how I index and >> search. >> >> There are many concurrent process running, first, the application >> will write >> records that received into a text file with tab separated each >> different >> field. Application will point to a new file every 10mins and start >> writing >> to it. So every file will contains only 10mins record, approximate >> 600,000 >> records per file. Then, the indexing process will check whether >> there is a >> text file to be index, if it is, the thread will wake up and start >> perform >> indexing. >> >> The indexing process will first add documents to RAMDir, Then >> later, add >> RAMDir into FSDir by calling addIndexNoOptimize() when there is >> 100,000 >> documents(32 fields per doc) in RAMDir. There is only 1 IndexWriter >> (FSDir) >> was created but a few IndexWriter(RAMDir) was created during the whole >> process. Below are some configuration for IndexWriters that I >> mentioned:- >> >> IndexWriter (RAMDir) >> - SimpleAnalyzer >> - setMaxBufferedDocs(1) >> - Filed.Store.YES >> - Field.Index.NO_NORMS >> >> IndexWriter (FSDir) >> - SimpleAnalyzer >> - setMergeFactor(20) >> - addIndexesNoOptimize() >> >> For the searching, because there are many queries(20,000) run >> continuously >> to generate the aggregate table for reporting purpose. All this >> queries is >> run in nested loop, and there is only 1 Searcher created, I try >> searcher and >> filter as well, filter give me better result, but both also utilize >> lots of >> CPU resources. >> >> Hope this info will help and sorry for my bad English. >> >> Thanks >> eChuang, Chew >> >> -Original Message- >> From: karl wettin [mailto:[EMAIL PROTECTED] >> Sent: Tuesday, July 31, 2007 5:54 PM >> To: java-user@lucene.apache.org >> Subject: Re: High CPU usage duing index and search >> >> >> 31 jul 2007 kl. 05.25 skrev Chew Yee Chuang: >>> But just notice that when Lucene performing search or index, >>> the CPU usage on my machine raise to 100%, because of this issue, >>> some of my >>> others backend process will slow down eventually. Just want to know >>> does >>> anyone face this problem before ? and is it any idea on how to >>> overcome this >>> problem ? >> >> Did you run a profiler to see what it is that consume all the >> resources? >> It is very hard to guess based on the information you supplied. Start >> here: >> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance >> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed >> http://wiki.apache.org/lucene-java/Impr
Nested concept fields
I'm trying to index concepts within a document, and search them within the context of a multivalued field. I'm not even sure it's possible with the QueryParser or QsolParser syntax. Does anyone know if it is / is not possible? If not, is it conceptually possible using the Query API? What I'd like to do: I'm currently indexing sentences as individual 'sent' fields. I plan to create indices on other information I parse out of a document (e.g. numerics, people names, company names). Suppose I call my numeric index 'num'. Then I would like to do something like this (in search pseudocode): sent:(expired num[1 TO 5] "days ago") I don't see how to do this using either Lucene's QueryParser or the QsolParser. Is it possible to do it using the Query API (and the appropriate indexing changes)? Thanks for any pointers. Jeff -- View this message in context: http://www.nabble.com/Nested-concept-fields-tf4258984.html#a12120367 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
performance on filtering against thousands of different publications
Hi all, My problem is as follows: Our documents each comes from a different publication. And we currently have > 5000 different publication sources. Our clients can choose arbitrarily a subset of the publications while performing search. It is not uncommon that a search will have to match hundreds or thousands of publications. I currently try to index the publication information as a field in each document. and use a TermsFilter when performing search. However the performance is less than satisfactory. Many simple searches takes more than 2-3 seconds. (our goal: < 0.5seconds). Using the CachingWrapperFilter is great for search speed. But I've done some calculation and figured that it is basically impossible to cache all combination of publications or even some common combinations. Is there any other more effective way to do the filtering? (I know that the slowness is not purely due to the publication filter, we also have some other things that will slow down the search. But this one definitely contributed quite a lot to the overall search time) Regards, Cedric - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Range queries in Lucene - numerical or lexicographical
Thanks. Probably this should be mentioned on the documentation page. -Nilesh On 8/12/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > As has been discussed several times, Lucene is a string-only engine, and > has no native understanding of numerical values. You have to normalize > them for string searches. See NumberTools. > > Best > Erick > > On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote: > > > > Hi all, > > > > Lucene query parser synax page > > (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides > > the following two examples of range query: > > mod_date:[20020101 TO 20030101] > > and > > title:{Aida TO Carmen} > > > > Now my question is, numerically 10 is greater than 2, but in > > string-only comparison 2 is greater than 10. So if I search for > > field:[10 TO 30] > > will a document with field=2 will be in result or not. > > > > And if I search for a string field, > > field:[AA TO CC] > > will document with field="B" will be in result or not. > > > > The semantics of range is not clear (numerical or lexicographical) > > from the documentation. > > > > thanks > > Nilesh > > > > -- > > Nilesh Bansal. > > http://queens.db.toronto.edu/~nilesh/ > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > -- Nilesh Bansal. http://queens.db.toronto.edu/~nilesh/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Range queries in Lucene - numerical or lexicographical
Thanks Erick but unfortunately NumberTools works only with long primitive type I am wondering why you didn't put some method for double and float. On 8/13/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote: > > Thanks. Probably this should be mentioned on the documentation page. > > -Nilesh > > On 8/12/07, Erick Erickson <[EMAIL PROTECTED]> wrote: > > As has been discussed several times, Lucene is a string-only engine, and > > has no native understanding of numerical values. You have to normalize > > them for string searches. See NumberTools. > > > > Best > > Erick > > > > On 8/11/07, Nilesh Bansal <[EMAIL PROTECTED]> wrote: > > > > > > Hi all, > > > > > > Lucene query parser synax page > > > (http://lucene.apache.org/java/docs/queryparsersyntax.html) provides > > > the following two examples of range query: > > > mod_date:[20020101 TO 20030101] > > > and > > > title:{Aida TO Carmen} > > > > > > Now my question is, numerically 10 is greater than 2, but in > > > string-only comparison 2 is greater than 10. So if I search for > > > field:[10 TO 30] > > > will a document with field=2 will be in result or not. > > > > > > And if I search for a string field, > > > field:[AA TO CC] > > > will document with field="B" will be in result or not. > > > > > > The semantics of range is not clear (numerical or lexicographical) > > > from the documentation. > > > > > > thanks > > > Nilesh > > > > > > -- > > > Nilesh Bansal. > > > http://queens.db.toronto.edu/~nilesh/ > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > -- > Nilesh Bansal. > http://queens.db.toronto.edu/~nilesh/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Regards, Mohammad -- see my blog: http://brainable.blogspot.com/ another in Persian: http://fekre-motefavet.blogspot.com/
Index file size limitation of 2GB
Hi all, I have bulk of data to be indexed and that may cross index file size of 2GB. As lucene faq tells that if index file size increses to 2GB there will be problems. but faq tells to make index subdirectory in this case. I have tried to do so made a index subdirectory in index main directory when index file size increses to 2GB but during search I don't get any result from index subdirectory. do I need to search recursively but in that case there will be more than "hits" object then how to combine them and return a single result to the user? Please tell me Thanks & Regards, Rohit
Re: Range queries in Lucene - numerical or lexicographical
: Subject: Re: Range queries in Lucene - numerical or lexicographical : : Thanks. Probably this should be mentioned on the documentation page. it does say right above the "date" example: " Sorting is done lexicographically." (Admitedly I'm not sure why the word "Sorting" is used in that sentence, but it should make it clear that it's a lexicographical comparison) patches to improve documentation are always appreciated! -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]