Re: Cache full text into memory
You have two options 1. Store the compressed text as part of stored field in Solr. 2. Using external caching. http://www.findbestopensource.com/tagged/distributed-caching You could use ehcache / Memcache / Membase. The problem with external caching is you need to synchronize the deletions and modification. Fetching the stored field from Solr is also faster. Regards Aditya www.findbestopensource.com On Wed, Jul 14, 2010 at 12:08 PM, Li Li wrote: > I want to cache full text into memory to improve performance. > Full text is only used to highlight in my application(But it's very > time consuming, My avg query time is about 250ms, I guess it will cost > about 50ms if I just get top 10 full text. Things get worse when get > more full text because in disk, it scatters erverywhere for a query.). > My full text per machine is about 200GB. The memory available for > store full text is about 10GB. So I want to compress it in memory. > Suppose compression ratio is 1:5, then I can load 1/4 full text in > memory. I need a Cache component for it. Has anyone faced the problem > before? I need some advice. Is it possbile using external tools such > as MemCached? Thank you. >
Re: ShingleFilter failing with more terms than index phrase
Hi Steve, Thanks for your kind response. I checked PositionfilterFactory (re-index as well) but that also didn't solve the problem. Interesting the problem is not reproduceable from Solr's Field Analysis page, it manifests only when it's in a query. I guess the subject for this post is not very correct, it's not that ShingleFilter is failing but -- using ShingleFilter, there is no score provided by the shingle field when I pass more terms than the indexed terms. I observe this using debugQuery. I had actually posted to solr-user but received no response yet. Probably because the problem is not clear at first glance. However, there's an example I have put in the mail for someone interested to try out and check if there's a problem. Let's see if I receive any response. -Ethan On Tue, Jul 13, 2010 at 9:15 PM, Steven A Rowe wrote: > Hi Ethan, > > You'll probably get better answers about Solr specific stuff on the > solr-u...@a.l.o list. > > Check out PositionFilterFactory - it may address your issue: > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory > > Steve > >> -Original Message- >> From: Ethan Collins [mailto:collins.eth...@gmail.com] >> Sent: Tuesday, July 13, 2010 3:42 AM >> To: java-user@lucene.apache.org >> Subject: ShingleFilter failing with more terms than index phrase >> >> I am using lucene 2.9.3 (via Solr 1.4.1) on windows and am trying to >> understand ShingleFilter. I wrote the following code and find that if I >> provide more words than the actual phrase indexed in the field, then the >> search on that field fails (no score found with debugQuery=true). >> >> Here is an example to reproduce, with field names: >> Id: 1 >> title_1: Nina Simone >> title_2: I put a spell on you >> >> Query (dismax) with: >> - “Nina Simone I put” <- Fails i.e. no score shown from title_1 search >> (using debugQuery) >> - “Nina Simone” <- SUCCESS >> >> But, when I used Solr’s Field Analysis with the ‘shingle’ field (given >> below) and tried “Nina Simone I put”, it succeeds. It’s only during the >> query that no score is provided. I also checked ‘parsedquery’ and it shows >> disjunctionMaxQuery issuing the string “Nina_Simone Simone_I I_put” to the >> title_1 field. >> >> title_1 and title_2 fields are of type ‘shingle’, defined as: >> >> > positionIncrementGap="100" indexed="true" stored="true"> >> >> >> >> > maxShingleSize="2" outputUnigrams="false"/> >> >> >> >> >> > maxShingleSize="2" outputUnigrams="false"/> >> >> >> >> Note that I also have a catchall field which is text. I have qf set >> to: 'id^2 catchall' and pf set to: 'title_1^1.5 title_2^1.2' >> >> If I am missing something or doing something wrong please let me know. >> >> -Ethan >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for searcher memory usage?
On Tue, 2010-07-13 at 23:49 +0200, Christopher Condit wrote: > * 20 million documents [...] > * 140GB total index size > * Optimized into a single segment I take it that you do not have frequent updates? Have you tried to see if you can get by with more segments without significant slowdown? > The application will run with 10G of -Xmx but any less and it bails out. > It seems happier if we feed it 12GB. The searches are starting to bog > down a bit (5-10 seconds for some queries)... 10G sounds like a lot for that index. Two common memory-eaters are sorting by field value and faceting. Could you describe what you're doing in that regard? Similarly, the 5-10 seconds for some queries seems very slow. Could you give some examples on the queries that causes problems together with some examples of fast queries and how long they take to execute? The standard silver bullet for easy performance boost is to buy a couple of consumer grade SSDs and put them on the local machine. If you're gearing up to use more machines you might want to try this first. Regards, Toke - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ShingleFilter failing with more terms than index phrase
Hi Steve, Thanks, wrapping with PositionFilter actually worked the search and score -- I made a mistake while re-indexing last time. Trying to analyze PositionFilter: didn't understand why earlier the search of 'Nina Simone I Put' failed since atleast the phrase 'Nina Simone' should have matched against title_0 field. Any clue? I am also trying to understand the impact of PositionFilter on phrase search quality and score. Unfortunately there are not enough literature/help put up by google. -Ethan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for searcher memory usage?
You can also set the termsIndexDivisor when opening the IndexReader. The terms index is an in-memory data structure and it an consume ALOT of RAM when your index has many unique terms. Flex (only on Lucene's trunk / next major release (4.0)) has reduced this RAM usage (as well as the RAM required when sorting by string field with mostly ascii content) substantially -- see http://chbits.blogspot.com/2010/07/lucenes-ram-usage-for-searching.html Mike On Tue, Jul 13, 2010 at 6:09 PM, Paul Libbrecht wrote: > > > Le 13-juil.-10 à 23:49, Christopher Condit a écrit : > >> * are there performance optimizations that I haven't thought of? > > The first and most important one I'd think of is get rid of NFS. > You can happily do a local copy which might, even for 10 Gb take less than > 30 seconds at server start. > > paul > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ShingleFilter failing with more terms than index phrase
> Trying to analyze PositionFilter: didn't understand why earlier the > search of 'Nina Simone I Put' failed since atleast the phrase 'Nina > Simone' should have matched against title_0 field. Any clue? Please note that I have configure the ShingleFilter as bigrams without unigrams. [Honestly, I am still struggling to understand how this worked and the earlier one didn't] -Ethan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to create a fuzzy suggest
Hi, i had a similar need to create somethink that acts not like a "filter" or "tokenizer" but only inserts self-generated tokens into the token-stream. (my purpose was to generate all kinds of word-forms for german umlauts...) the following code-base helped me a lot to create it: http://207.44.206.178/message.jspa?messageID=91989#91991 the synonym-filter also adds tokens into the tokenstream regards, Alex On Wednesday 14 July 2010 01:11:02 Kai Weingärtner wrote: > Hello, > > > I am trying to create a suggest search (search results are displayed while > the user is entering the query) for names, but the search should also give > results if the given name just sounds like an indexed name. However a > perfect match should be ranked higher than a similar sounding match. > > > I looked at the SpellChecker contrib, but this AFAIK cannot handle > incomplete names (edge n-grams). > > > So I came up with this idea and it would be great if anyone could tell me > if that is sensible or if there is a better way: > > > I create an analyzer to be run on the full names, which does the following > - lowercase > - build edge n-grams > put these terms in the field (this would handle correctly spelled input) > > > - run soundex on the n-grams > put there soundexed n-grams in the field as well > > > The incoming query will then also run through this analyzer with an > or-default. So a correct spelling will match the normal n-grams plus the > soundexed n-grams leading to a good score. A missspelled name would still > match the soundexed n-grams, leading to a somewhat lower score. > > > My current problem is that I don't know how to duplicate the tokens in the > analyzer so I can add them as normal n-grams and soundexed n-grams. I > suppose the TeeSinkTokenFilter will get me there, but I could not figure > out how to add all tokens back in one stream. > > > To recap, my questions are: Could this approach work to create a "fuzzy > suggest"? How do I use the TeeSinkTokenFilter to separate and recombine the > tokenstream. > > > I hope that was clear, thanks for your help! > > > > Kai > > > > > Regelung im Bezug auf Paragraph 37a Absatz 4 HGB: WidasConcepts GmbH, > Geschaeftsfuehrer: Thomas Widmann und Christian Kappert, > Gerichtsstand Pforzheim, Registernummer: HRB 511442, > Umsatzsteueridentifikationsnummer: DE205851091 > > Diese E-Mail enthaelt vertrauliche und/oder rechtlich geschuetzte > Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail > irrtuemlich erhalten haben, informieren Sie bitte sofort den Absender und > vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte > Weitergabe dieser Mail sind nicht gestattet. > > This e-mail may contain confidential and/or privileged information. > If you are not the intended recipient (or have received this e-mail in > error) please notify the sender immediately and destroy this e-mail. > Any unauthorized copying, disclosure or distribution of the material in > this e-mail is strictly forbidden. -- Alexander Rothenberg Fotofinder GmbH USt-IdNr. DE812854514 Software EntwicklungWeb: http://www.fotofinder.net/ Potsdamer Str. 96 Tel: +49 30 25792890 10785 BerlinFax: +49 30 257928999 Geschäftsführer:Ali Paczensky Amtsgericht:Berlin Charlottenburg (HRB 73099) Sitz: Berlin - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Best open source
Hello all, We have launched a new site, which provides the best open source products and libraries across all categories. This site is powered by Solr search. There are many open source products available in all categories and it is sometimes difficult to identify which is the best. The main problem in open source is, there are lot more redudant products in a category and it is impossible for a User to try all. We identify the best. As a open source users, you might be using many opensource products and libraries , It would be great, if you help us by adding the information about the opensource products you use. http://www.findbestopensource.com/addnew http://www.findbestopensource.com/ Regards Aditya
Out of memory problem in search
Hello Friends; Recently, I have problem with lucene search - memory problem on the basis that indexed file is so big. (I have indexed some kinds of information and this indexed file's size is nearly more than 40 gigabyte. ) I search the lucene indexed file with org.apache.lucene.search.Searcher.search(query, null, offset + limit, new Sort(new SortField("time", SortField.LONG, true))); (This provides to find (offset + limit) records to back.) I use searching by range. For example, in web page I firstly search records which are in [0, 100] range then second page [100, 200] I have nearly 200,000 records at all. When I go to last page which means records between 200,000 -100, 200,0, there is a memory problem(I have 4gb ram on running machine) in jvm( out of memory error). Is there a way to overcome this memory problem? Thanks -- ilkay POLAT Software Engineer TURKEY Gsm : (+90) 532 542 36 71 E-mail : ilkay_po...@yahoo.com
Re: Out of memory problem in search
Certainly it will. Either you need to increase your memory OR refine your query. Eventhough you display paginated result. The first couple of pages will display fine and going towards last may face problem. This is because, 200,000 objects is created and iterated, 190,900 objects are skipped and last100 objects are returned. The memory is consumed in creating these objects. Regards Aditya www.findbestopensource.com On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat wrote: > Hello Friends; > > Recently, I have problem with lucene search - memory problem on the basis > that indexed file is so big. (I have indexed some kinds of information and > this indexed file's size is nearly more than 40 gigabyte. ) > > I search the lucene indexed file with > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > Sort(new SortField("time", SortField.LONG, true))); > (This provides to find (offset + limit) records to back.) > > I use searching by range. For example, in web page I firstly search records > which are in [0, 100] range then second page [100, 200] > I have nearly 200,000 records at all. When I go to last page which means > records between 200,000 -100, 200,0, there is a memory problem(I have 4gb > ram on running machine) in jvm( out of memory error). > > Is there a way to overcome this memory problem? > > Thanks > > -- > ilkay POLAT Software Engineer > TURKEY > > Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > >
RE: Out of memory problem in search
Reverse the query sorting to display the last page. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: ilkay polat [mailto:ilkay_po...@yahoo.com] > Sent: Wednesday, July 14, 2010 12:44 PM > To: java-user@lucene.apache.org > Subject: Out of memory problem in search > > Hello Friends; > > Recently, I have problem with lucene search - memory problem on the basis > that indexed file is so big. (I have indexed some kinds of information and this > indexed file's size is nearly more than 40 gigabyte. ) > > I search the lucene indexed file with > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > Sort(new SortField("time", SortField.LONG, true))); (This provides to find > (offset + limit) records to back.) > > I use searching by range. For example, in web page I firstly search records > which are in [0, 100] range then second page [100, 200] I have nearly 200,000 > records at all. When I go to last page which means records between 200,000 - > 100, 200,0, there is a memory problem(I have 4gb ram on running machine) in > jvm( out of memory error). > > Is there a way to overcome this memory problem? > > Thanks > > -- > ilkay POLAT Software Engineer > TURKEY > > Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Out of memory problem in search
Indeed, this is good solution to that kind of problems. But same problem can be occured in future when logs are added to index file. For example, here 200,000 records have problem(These logs are collected in 13 days). With that reverse way, there will be maximum search range is 100,000. But if there is 400,000 records same problem will be occured(Max search space is 200,000 again). Is there another way which do not consume so much memory or consume restrict memory and consume time instead of memory. This restriction come from our project hardware restrictions(Hardware memory is 8GB in maximum situation)? --- On Wed, 7/14/10, Uwe Schindler wrote: From: Uwe Schindler Subject: RE: Out of memory problem in search To: java-user@lucene.apache.org Date: Wednesday, July 14, 2010, 3:25 PM Reverse the query sorting to display the last page. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: ilkay polat [mailto:ilkay_po...@yahoo.com] > Sent: Wednesday, July 14, 2010 12:44 PM > To: java-user@lucene.apache.org > Subject: Out of memory problem in search > > Hello Friends; > > Recently, I have problem with lucene search - memory problem on the basis > that indexed file is so big. (I have indexed some kinds of information and this > indexed file's size is nearly more than 40 gigabyte. ) > > I search the lucene indexed file with > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > Sort(new SortField("time", SortField.LONG, true))); (This provides to find > (offset + limit) records to back.) > > I use searching by range. For example, in web page I firstly search records > which are in [0, 100] range then second page [100, 200] I have nearly 200,000 > records at all. When I go to last page which means records between 200,000 - > 100, 200,0, there is a memory problem(I have 4gb ram on running machine) in > jvm( out of memory error). > > Is there a way to overcome this memory problem? > > Thanks > > -- > ilkay POLAT Software Engineer > TURKEY > > Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Out of memory problem in search
Hi, We have hardware restrictions(Max RAM can be 8GB). So, unfortunately, increasing memory can not be option for us for today's situation. Yes, as you said that problem is faced when goes to last pages of search screen because of using search method which is find top n records. In other way, this is meaning "searching all the thinngs returns all". I am now researching whether there is a way which consumes time instead of memory in this search mechanism in lucene? Any other ideas? Thanks --- On Wed, 7/14/10, findbestopensource wrote: From: findbestopensource Subject: Re: Out of memory problem in search To: java-user@lucene.apache.org Date: Wednesday, July 14, 2010, 2:59 PM Certainly it will. Either you need to increase your memory OR refine your query. Eventhough you display paginated result. The first couple of pages will display fine and going towards last may face problem. This is because, 200,000 objects is created and iterated, 190,900 objects are skipped and last100 objects are returned. The memory is consumed in creating these objects. Regards Aditya www.findbestopensource.com On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat wrote: > Hello Friends; > > Recently, I have problem with lucene search - memory problem on the basis > that indexed file is so big. (I have indexed some kinds of information and > this indexed file's size is nearly more than 40 gigabyte. ) > > I search the lucene indexed file with > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > Sort(new SortField("time", SortField.LONG, true))); > (This provides to find (offset + limit) records to back.) > > I use searching by range. For example, in web page I firstly search records > which are in [0, 100] range then second page [100, 200] > I have nearly 200,000 records at all. When I go to last page which means > records between 200,000 -100, 200,0, there is a memory problem(I have 4gb > ram on running machine) in jvm( out of memory error). > > Is there a way to overcome this memory problem? > > Thanks > > -- > ilkay POLAT Software Engineer > TURKEY > > Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > >
subset query :query filter or boolean query
Hi , I have 4 query search fields. case 1 : if i use one search field to make a query filter and then use the query filter to search on other 3 fields so as to reduce the searching docs subset. case 2: i use all query parameters using boolean query , whole of index will be searched. Which of the two approach will give better performance.Or is there ne other approach to do this . Also Can we use subset of documents , for searching . Lets say I have hash map of P1 -1,2,3,4 P2 - 3,4,5 P3-7,5,3 Now I have an documents in lucene index stored as 1-P1 2-P1 3-P1,P2,P3 4-P1,P2 5-P2,P3 7-P3 .. .. when i search docs with P2 I get 3,4,5 Now I want my search to b restricted to just 3,4,5 doc only. where by I can search only these docs for further parameters. 1. How to go abt it. 2. Is there any other seraching mechanism I should use, or Lucene is better fit? 3. should i keep my hash map also in lucene indexes and is then thr a method to link it to another lucene indexes. regards, Suman
Re: Out of memory problem in search
I have also confused about the memory management of lucene. Where is this out of memory problem is mainly arised from Reason-1 or Reason-2 reason? Reason-1 : Problem is sourced from searching is done in big indexed file (nearly 40 GB) If there is 100(small number of records) records returned from search in 60 GB indexed file, problem will again arised. OR Reason-2 : Problem is sourced from finding so many records(nearly 200,000 records), so in memory 200, 000 java object in heap? If file's sizeis 10 GB(small file size ) but returned records are so many, problem will again arised. Is there any document which tells the general memory management issues in searching in lucene? Thanks ilkay POLAT Software Engineer Gsm : (+90) 532 542 36 71 E-mail : ilkay_po...@yahoo.com --- On Wed, 7/14/10, ilkay polat wrote: From: ilkay polat Subject: Re: Out of memory problem in search To: java-user@lucene.apache.org Date: Wednesday, July 14, 2010, 3:54 PM Hi, We have hardware restrictions(Max RAM can be 8GB). So, unfortunately, increasing memory can not be option for us for today's situation. Yes, as you said that problem is faced when goes to last pages of search screen because of using search method which is find top n records. In other way, this is meaning "searching all the thinngs returns all". I am now researching whether there is a way which consumes time instead of memory in this search mechanism in lucene? Any other ideas? Thanks --- On Wed, 7/14/10, findbestopensource wrote: From: findbestopensource Subject: Re: Out of memory problem in search To: java-user@lucene.apache.org Date: Wednesday, July 14, 2010, 2:59 PM Certainly it will. Either you need to increase your memory OR refine your query. Eventhough you display paginated result. The first couple of pages will display fine and going towards last may face problem. This is because, 200,000 objects is created and iterated, 190,900 objects are skipped and last100 objects are returned. The memory is consumed in creating these objects. Regards Aditya www.findbestopensource.com On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat wrote: > Hello Friends; > > Recently, I have problem with lucene search - memory problem on the basis > that indexed file is so big. (I have indexed some kinds of information and > this indexed file's size is nearly more than 40 gigabyte. ) > > I search the lucene indexed file with > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > Sort(new SortField("time", SortField.LONG, true))); > (This provides to find (offset + limit) records to back.) > > I use searching by range. For example, in web page I firstly search records > which are in [0, 100] range then second page [100, 200] > I have nearly 200,000 records at all. When I go to last page which means > records between 200,000 -100, 200,0, there is a memory problem(I have 4gb > ram on running machine) in jvm( out of memory error). > > Is there a way to overcome this memory problem? > > Thanks > > -- > ilkay POLAT Software Engineer > TURKEY > > Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > >
Re: Continuously iterate over documents in index
You could have a field within each doc say "Processed" and store a > value Yes/No, next run a searcher query which should give you the > collection of unprocessed ones. > That sounds like a reasonable idea, and I just realized that I could have done that in a way specific to my application. However, I already tried doing something with a MatchAllDocsQuery with a custom collector and sort by date. I store the last date and time of a doc I processed and process only newer ones.
Re: Continuously iterate over documents in index
All, Issue: Unable to get the proper results after searching. I added sample code which I used in the application. If I used *numHitPerPage* value as 1000 its giving expected results. ex: The expected results is 32 docs but showing 32 docs Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results. ex: The expected results is 32 docs but showing only 29 docs. Sample code below: StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField, analyzer); Query q = qp.parse(queryString); TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true); IndexSearcher(is).search(q,tdc); ScoreDocs[] noDocs = tdc.topDocs().scoreDocs; Please let me know if any other way to search? Thanks. Kiran. M
RE: Best practices for searcher memory usage?
Hi Toke- > > * 20 million documents [...] > > * 140GB total index size > > * Optimized into a single segment > > I take it that you do not have frequent updates? Have you tried to see if you > can get by with more segments without significant slowdown? Correct - in fact there are no updates and no deletions. We index everything offline when necessary and just swap the new index in... By more segments do you mean not call optimize() at index time? > > The application will run with 10G of -Xmx but any less and it bails out. > > It seems happier if we feed it 12GB. The searches are starting to bog > > down a bit (5-10 seconds for some queries)... > > 10G sounds like a lot for that index. Two common memory-eaters are sorting > by field value and faceting. Could you describe what you're doing in that > regard? No faceting and no sorting (other than score) for this index... > Similarly, the 5-10 seconds for some queries seems very slow. Could you give > some examples on the queries that causes problems together with some > examples of fast queries and how long they take to execute? Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND (Salsa OR Sauce) AND (This OR That) The latter is most typical. With a single keyword it will execute in < 1 second. In a case where there are 10 clauses it becomes much slower (which I understand, just looking for ways to speed it up)... Thanks, -Chris
Re: Out of memory problem in search
This doesn't make sense to me. Are you saying that you only have 200,000 documents in your index? Because keeping a score for 200K documents should consume a relatively trivial amount of memory. The fact that you're sorting by time is a red flag, but it's only a long, so 200K documents shouldn't strain memory due to sorting either. The critical thing here isn't necessarily the size of your index, but the number of documents in that index and the number of unique values you're sorting by. By the way, what happens if you don't sort? Since it doesn't make sense to me, that must mean I don't understand the problem very thoroughly. Could you provide some index characteristics? Saying it's 40G leaves a lot open to speculation. That could be 39G of stored text which is mostly irrelevant for searching. Or it could be entirely indexed, tokenized data which would be a different thing. How many documents do you have in your index? What does your query look like? You can get an idea of the amount of your index holding indexed tokens by NOT storing any of the fields, just indexing them (Field.Store.NO) What version of Lucene are you using? How do you start your process? If you start the application with java's default memory, that's not very much (64M if memory serves). You may be using nowhere near your hardware limits. Try specifying -Xmx512M and/or the -server option. Best Erick On Wed, Jul 14, 2010 at 9:27 AM, ilkay polat wrote: > I have also confused about the memory management of lucene. > > Where is this out of memory problem is mainly arised from Reason-1 or > Reason-2 reason? > > Reason-1 : Problem is sourced from searching is done in big indexed file > (nearly 40 GB) If there is 100(small number of records) records returned > from search in 60 GB indexed file, problem will again arised. > OR > Reason-2 : Problem is sourced from finding so many records(nearly 200,000 > records), so in memory 200, 000 java object in heap? If file's sizeis 10 > GB(small file size ) but returned records are so many, problem will again > arised. > > Is there any document which tells the general memory management issues in > searching in lucene? > > Thanks > > > ilkay POLAT Software Engineer Gsm : (+90) 532 542 36 71 > E-mail : ilkay_po...@yahoo.com > > --- On Wed, 7/14/10, ilkay polat wrote: > > From: ilkay polat > Subject: Re: Out of memory problem in search > To: java-user@lucene.apache.org > Date: Wednesday, July 14, 2010, 3:54 PM > > Hi, > We have hardware restrictions(Max RAM can be 8GB). So, unfortunately, > increasing memory can not be option for us for today's situation. > > Yes, as you said that problem is faced when goes to last pages of search > screen because of using search method which is find top n records. In other > way, this is meaning "searching all the thinngs returns all". > > I am now researching whether there is a way which consumes time instead of > memory in this search mechanism in lucene? Any other ideas? > > Thanks > > --- On Wed, 7/14/10, findbestopensource > wrote: > > From: findbestopensource > Subject: Re: Out of memory problem in search > To: java-user@lucene.apache.org > Date: Wednesday, July 14, 2010, 2:59 PM > > Certainly it will. Either you need to increase your memory OR refine your > query. Eventhough you display paginated result. The first couple of pages > will display fine and going towards last may face problem. This is because, > 200,000 objects is created and iterated, 190,900 objects are skipped and > last100 objects are returned. The memory is consumed in creating these > objects. > > Regards > Aditya > www.findbestopensource.com > > > > On Wed, Jul 14, 2010 at 4:14 PM, ilkay polat > wrote: > > > Hello Friends; > > > > Recently, I have problem with lucene search - memory problem on the basis > > that indexed file is so big. (I have indexed some kinds of information > and > > this indexed file's size is nearly more than 40 gigabyte. ) > > > > I search the lucene indexed file with > > org.apache.lucene.search.Searcher.search(query, null, offset + limit, new > > Sort(new SortField("time", SortField.LONG, true))); > > (This provides to find (offset + limit) records to back.) > > > > I use searching by range. For example, in web page I firstly search > records > > which are in [0, 100] range then second page [100, 200] > > I have nearly 200,000 records at all. When I go to last page which means > > records between 200,000 -100, 200,0, there is a memory problem(I have 4gb > > ram on running machine) in jvm( out of memory error). > > > > Is there a way to overcome this memory problem? > > > > Thanks > > > > -- > > ilkay POLAT Software Engineer > > TURKEY > > > > Gsm : (+90) 532 542 36 71 > > E-mail : ilkay_po...@yahoo.com > > > > > > > > > > > > > >
Re: Continuously iterate over documents in index
Kiran: Please start a new thread when asking a new question. From Hossman's apache page: When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking On Wed, Jul 14, 2010 at 10:56 AM, Kiran Kumar wrote: > All, > > Issue: Unable to get the proper results after searching. I added sample > code > which I used in the application. > > If I used *numHitPerPage* value as 1000 its giving expected results. > ex: The expected results is 32 docs but showing 32 docs > Instead If I use *numHitPerPage* as 2^32-1 its not giving expected results. > ex: The expected results is 32 docs but showing only 29 docs. > > Sample code below: > > > StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); > QueryParser qp = new QueryParser(Version.LUCENE_CURRENT, defField, > analyzer); > Query q = qp.parse(queryString); > TopDocsCollector tdc = TopScoreDocCollector.create(*numHitPerPage*, true); > IndexSearcher(is).search(q,tdc); > > ScoreDocs[] noDocs = tdc.topDocs().scoreDocs; > > Please let me know if any other way to search? > > Thanks. > Kiran. M >
Re: Continuously iterate over documents in index
H, if you somehow know the last date you processed, why wouldn't using a range query work for you? I.e. date:[ TO ]? Best Erick On Wed, Jul 14, 2010 at 10:37 AM, Max Lynch wrote: > You could have a field within each doc say "Processed" and store a > > > value Yes/No, next run a searcher query which should give you the > > collection of unprocessed ones. > > > > That sounds like a reasonable idea, and I just realized that I could have > done that in a way specific to my application. However, I already tried > doing something with a MatchAllDocsQuery with a custom collector and sort > by > date. I store the last date and time of a doc I processed and process only > newer ones. >
Re: Best practices for searcher memory usage?
There are a number of strategies, on the Java or OS side of things: - Use huge pages[1]. Esp on 64 bit and lots of ram. For long running, large memory (and GC busy) applications, this has achieved significant improvements. Like 300% on EJBs. See [2],[3],[4]. For a great article introducing and benchmarking huge tables, both in C and Java, see [5] To see if huge pages might help you, do > cat /proc/meminfo And check on the "PageTables:26480 kB" If the PageTables is, say, more than 1-2GBs, you should consider using huge pages. - assuming multicore: there are times (very application dependent) when having your application running on all cores turns out not to produce the best performance. Take one core out making it available to look after system things (I/O, etc) sometimes will improve performance. Use numactl[6] to bind your application to n-1 cores, leaving one out. - - numactl also allows you to restrict memory allocation to 1-n cores, which also may be useful depending on your application - The Java vm from Sun-Oracle has a number of options[7] - -XX:+AggressiveOpts [You should have this one on always...] - -XX:+StringCache - -XX:+UseFastAccessorMethods - -XX:+UseBiasedLocking [My experience has this helping some applications, hindering others...] - -XX:ParallelGCThreads= [Usually this is #cores; try reducing this to n/2] - -Xss128k - -Xmn [Make this large, like of your 40% of heap -Xmx If you do this use -XX:+UseParallelGC See [8] You can also play with the many GC parameters. This is pretty arcane, but can give you good returns. And of course, I/O is important: data on multiple disks with multiple controllers; RAID; filesystem tuning ; turn off atime; readahead buffer (change from 128k to 8MB on Linux: see [9]) OS tuning. See [9] for a useful filesystem comparison (for Postgres). -glen http://zzzoot.blogspot.com/ [1]http://zzzoot.blogspot.com/2009/02/java-mysql-increased-performance-with.html [2]http://andrigoss.blogspot.com/2008/02/jvm-performance-tuning.html [3]http://kirkwylie.blogspot.com/2008/11/linux-fork-performance-redux-large.html [4]http://orainternals.files.wordpress.com/2008/10/high_cpu_usage_hugepages.pdf [5]http://lwn.net/Articles/374424/ [6]http://www.phpman.info/index.php/man/numactl/8 [7]http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp#PerformanceTuning [8]http://java.sun.com/performance/reference/whitepapers/tuning.html#section4.2.5 [9]http://assets.en.oreilly.com/1/event/27/Linux%20Filesystem%20Performance%20for%20Databases%20Presentation.pdf On 15 July 2010 04:28, Christopher Condit wrote: > Hi Toke- >> > * 20 million documents [...] >> > * 140GB total index size >> > * Optimized into a single segment >> >> I take it that you do not have frequent updates? Have you tried to see if you >> can get by with more segments without significant slowdown? > > Correct - in fact there are no updates and no deletions. We index everything > offline when necessary and just swap the new index in... > By more segments do you mean not call optimize() at index time? > >> > The application will run with 10G of -Xmx but any less and it bails out. >> > It seems happier if we feed it 12GB. The searches are starting to bog >> > down a bit (5-10 seconds for some queries)... >> >> 10G sounds like a lot for that index. Two common memory-eaters are sorting >> by field value and faceting. Could you describe what you're doing in that >> regard? > > No faceting and no sorting (other than score) for this index... > >> Similarly, the 5-10 seconds for some queries seems very slow. Could you give >> some examples on the queries that causes problems together with some >> examples of fast queries and how long they take to execute? > > Typically just TermQueries or BooleanQueries: (Chip OR Nacho OR Foo) AND > (Salsa OR Sauce) AND (This OR That) > The latter is most typical. > > With a single keyword it will execute in < 1 second. In a case where there > are 10 clauses it becomes much slower (which I understand, just looking for > ways to speed it up)... > > Thanks, > -Chris > -- - - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Best practices for searcher memory usage?
Glen, thank you for this very thorough and informative post. Lance Norskog - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org