problems with SpellCheckComponent
Hi, I have downloaded the trunk version today and I´m having problems with the SpellCheckComponent. Its any known bug? This is my configuration: # searchComponent name=spellcheck class=org.apache.solr.handler.component.SpellCheckComponent lst name=defaults !-- omp = Only More Popular -- str name=spellcheck.onlyMorePopularfalse/str !-- exr = Extended Results -- str name=spellcheck.extendedResultsfalse/str !-- The number of suggestions to return -- str name=spellcheck.count1/str /lst str name=queryAnalyzerFieldTypetext/str lst name=spellchecker str name=namedefault/str str name=fieldtitle/str str name=spellcheckIndexDirspellchecker_defaultXX/str /lst /searchComponent queryConverter name=queryConverter class=org.apache.solr.spelling.SpellingQueryConverter / requestHandler name=/spellCheckCompRH class=org.apache.solr.handler.component.SearchHandler arr name=last-components strspellcheck/str /arr /requestHandler ## SCHEMA.XML:... field name=*title* type=*text* indexed=*true* stored= *true* / ... When I made: http://localhost:8080/solr/spellCheckCompRH?q=*:*spellcheck.q=ruckspellcheck=true I have this exception: Estado HTTP 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.SpellCheckComponent.getTokens(SpellCheckComponent.java:217) at org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:184) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:156) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:128) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1025) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:272) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:852) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:584) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1508) at java.lang.Thread.run(Unknown Source) Any help will be very usefull for me. Thanks for your attention. Rober
never desallocate RAM...during search
Hi users, Somedays ago I made a question about RAM use during searchs but I didn't solve my problem with the ideas that some expert users told me. After making somes test I can make a more specific question hoping someone can help me. My problem is that i need highlighting and i have quite big docs (txt of 40MB). The conclusion of my tests is that if I set rows to 10, the content of the first 10 results are cached. This if something normal because its probable needed for the highlighting, but this memory is never desallocate although I set solr's caches to 0. With this, the memory grows up until is close to the heap, then the gc start to desallocate memory..but at that point the searches are quite slow. Is this a normal behavior? Can I configure some solr parameter to force the desallocation of results after each search? [I´m using solr 1.2] Another thing that I found is that although I comment (in solrconfig) all this options: filterCache, queryResultCache, documentCache, enableLazyFieldLoading, useFilterForSortedQuery, boolTofilterOptimizer In the stats always appear caching:true. I'm probably leaving some stupid thing but I can't find it. If anyone can help me..i'm quite desperate. Rober.
Re: doubt with an index of 300gb
Hi Otis, Thanks a lot for your interest. The main thing i cant understand very well is that if I have 8 maquines that will be searchers, for example, why they will have a higher cost of hw if I have one big index. If I have 10 smaller indexes I will need to search over all of them so...that won´t requiere the same hw? I understand that if i can search in a subset of the index it would be better to split the index but if i must search in the entire index? I can add new searcher maquines so i think that my hw problem is the ram, its that right? Probably i'm missing something, sorry if my question have an obvious answer. 2008/6/15 Otis Gospodnetic [EMAIL PROTECTED]: Hi Roberto, SAN is a fine choice, if that's what you were worried about. There is no way to tell exactly how fast your searches will be, as that depends on a lot of factors -- benchmarking with your own data and hardware and queries is the best way to go. As for the cost of multiple smaller machines and one large one (if that's what's needed) is that, I *think*, the price of hw goes up significantly when you start working with high-end hw, and that cost may be higher than the cost of N smaller servers combined. That's the cost difference that I was trying to point out. That's for your IT people to figure out after you tell them what type of hw you need and what the options are. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, June 14, 2008 5:05:54 PM Subject: Re: doubt with an index of 300gb Hi Otis, Thanks for your fast answer. I understand perfectly your points. I will explain my limitations ... --Multiple smaller indices you can split them across several servers, but you can't do that with a monolithic index. The index will be allocated in a SAN that is not under my election. I can decide to split the index or use a monolithic one but not the allocation --With multiple smaller indices you can choose to search only a subset of them, should that make sense for your app. --How much does it cost to have 1 server with a LOT of RAM that serving this index will need? Maybe it's cheaper to have multiple smaller machines. This index will be an index public and i will always need to search in the entire index. I understand the problem of the RAM, but if I use multiple index and then i search in all of them i will use less RAM? The index will have 10 fields, all of them excepting the content will be small and I will only sort be score. If someone have any experience of how much ram i will need or something about the response times with this kind of index it would be very usefull for me. --How long does it take you to rebuild one big index, should it get corrupted vs. rebuilding only a subset of your data? This is a very important aspect, but my primary objective must be the response time. I thought about using different index with different solr but the problem is the mixture of results and how to sort them...so i think (but not sure) that using only one index it will be faster knowing that i will always need to search in the entire index. Any help or suggestion will be very usefull. Thank you very much for your attention 2008/6/14 Otis Gospodnetic : Roberto, Here is some food for thought... Multiple smaller indices you can split them across several servers, but you can't do that with a monolithic index. With multiple smaller indices you can choose to search only a subset of them, should that make sense for your app. How much does it cost to have 1 server with a LOT of RAM that serving this index will need? Maybe it's cheaper to have multiple smaller machines. How long does it take you to rebuild one big index, should it get corrupted vs. rebuilding only a subset of your data? How long does it take you to copy the index around the network after you optimize it vs. copying only a subset, or multiple subsets in parallel? etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto To: solr-user@lucene.apache.org Sent: Saturday, June 14, 2008 7:31:28 AM Subject: doubt with an index of 300gb Hi users, I´m going to create a big index of 300gb in a SAN where i have 4TB. I read many entries in the mail list talking about using multiple index with multicore. I would like to know what kind of benefit can i have using multiple index instead of one big index if i dont have problems with the disk? I know that the optimizes and the commits would be faster with smaller indexs, but in search? The RAM use would be the same using 10 indexes of 30gb than using 1 index of 300gb? Any suggestion or experience
Re: Memory problems when highlight with not very big index
Hi Yonik, I think your are right, it must be that. If i activate the highlighting of a field that i´m not specifing in fl, it will have the same use of RAM as if i return it? Internally it will be as if I add it to fl? 2008/6/13 Yonik Seeley [EMAIL PROTECTED]: On Fri, Jun 13, 2008 at 3:30 PM, Roberto Nieto [EMAIL PROTECTED] wrote: The part that i can't understand very well is why if i desactivate highlighting the memory doesnt grows. It only uses doc cache if highlighting is used or if content retrieve is activated? Perhaps you are highlighting some fields that you normally don't return? What is fl vs hl.fl? -Yonik
Re: doubt with an index of 300gb
Hi Otis, I think that my questions were not very well formulated. We have dedicate machines for parsing, 2 machines (active/pasive) for indexing, the index allocated in a SAN filesystem and dedicate machines for searching. All of my questions came because if i have an index of 300gb i dont know how much ram i will need for searching in that index. I dont find anywere documents about memory use in solr and i'm a bit worried because i dont know how much memory i will need for attending each search. I dont have much problems with concurrent searchs because I can had machines to a cluster. I read about the filterCache, queryResultCache and documentCache but if i dont use those caches (set them to 0) i dont know how much memory solr will need (if its needed) to store the docSets orden them, etc ... and attend a search. If some document explain it, it will be very usefull for me. 2008/6/15 Otis Gospodnetic [EMAIL PROTECTED]: Roberto, All I was trying to say that it *might* be cheaper to buy: 10 smaller servers with 4 GB RAM each, for a total of 40 GB RAM than 1 big server with 40 GB RAM and the CPU matching the CPU power of 10 smaller servers Of course, there are other things to consider, too - power usage, hosting space, management, etc. There is no single answer, you'll have to evaluate pros and cons yourself. I simply wanted to point out various factors that you and your IT team will need to consider. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, June 15, 2008 8:38:15 AM Subject: Re: doubt with an index of 300gb Hi Otis, Thanks a lot for your interest. The main thing i cant understand very well is that if I have 8 maquines that will be searchers, for example, why they will have a higher cost of hw if I have one big index. If I have 10 smaller indexes I will need to search over all of them so...that won´t requiere the same hw? I understand that if i can search in a subset of the index it would be better to split the index but if i must search in the entire index? I can add new searcher maquines so i think that my hw problem is the ram, its that right? Probably i'm missing something, sorry if my question have an obvious answer. 2008/6/15 Otis Gospodnetic : Hi Roberto, SAN is a fine choice, if that's what you were worried about. There is no way to tell exactly how fast your searches will be, as that depends on a lot of factors -- benchmarking with your own data and hardware and queries is the best way to go. As for the cost of multiple smaller machines and one large one (if that's what's needed) is that, I *think*, the price of hw goes up significantly when you start working with high-end hw, and that cost may be higher than the cost of N smaller servers combined. That's the cost difference that I was trying to point out. That's for your IT people to figure out after you tell them what type of hw you need and what the options are. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto To: solr-user@lucene.apache.org Sent: Saturday, June 14, 2008 5:05:54 PM Subject: Re: doubt with an index of 300gb Hi Otis, Thanks for your fast answer. I understand perfectly your points. I will explain my limitations ... --Multiple smaller indices you can split them across several servers, but you can't do that with a monolithic index. The index will be allocated in a SAN that is not under my election. I can decide to split the index or use a monolithic one but not the allocation --With multiple smaller indices you can choose to search only a subset of them, should that make sense for your app. --How much does it cost to have 1 server with a LOT of RAM that serving this index will need? Maybe it's cheaper to have multiple smaller machines. This index will be an index public and i will always need to search in the entire index. I understand the problem of the RAM, but if I use multiple index and then i search in all of them i will use less RAM? The index will have 10 fields, all of them excepting the content will be small and I will only sort be score. If someone have any experience of how much ram i will need or something about the response times with this kind of index it would be very usefull for me. --How long does it take you to rebuild one big index, should it get corrupted vs. rebuilding only a subset of your data? This is a very important aspect, but my primary objective must be the response time. I thought about using different index with different solr but the problem is the mixture of results and how to sort them
doubt with an index of 300gb
Hi users, I´m going to create a big index of 300gb in a SAN where i have 4TB. I read many entries in the mail list talking about using multiple index with multicore. I would like to know what kind of benefit can i have using multiple index instead of one big index if i dont have problems with the disk? I know that the optimizes and the commits would be faster with smaller indexs, but in search? The RAM use would be the same using 10 indexes of 30gb than using 1 index of 300gb? Any suggestion or experience will be very usefull for me. Thanks in advance. Rober.
Re: doubt with an index of 300gb
Hi Otis, Thanks for your fast answer. I understand perfectly your points. I will explain my limitations ... --Multiple smaller indices you can split them across several servers, but you can't do that with a monolithic index. The index will be allocated in a SAN that is not under my election. I can decide to split the index or use a monolithic one but not the allocation --With multiple smaller indices you can choose to search only a subset of them, should that make sense for your app. --How much does it cost to have 1 server with a LOT of RAM that serving this index will need? Maybe it's cheaper to have multiple smaller machines. This index will be an index public and i will always need to search in the entire index. I understand the problem of the RAM, but if I use multiple index and then i search in all of them i will use less RAM? The index will have 10 fields, all of them excepting the content will be small and I will only sort be score. If someone have any experience of how much ram i will need or something about the response times with this kind of index it would be very usefull for me. --How long does it take you to rebuild one big index, should it get corrupted vs. rebuilding only a subset of your data? This is a very important aspect, but my primary objective must be the response time. I thought about using different index with different solr but the problem is the mixture of results and how to sort them...so i think (but not sure) that using only one index it will be faster knowing that i will always need to search in the entire index. Any help or suggestion will be very usefull. Thank you very much for your attention 2008/6/14 Otis Gospodnetic [EMAIL PROTECTED]: Roberto, Here is some food for thought... Multiple smaller indices you can split them across several servers, but you can't do that with a monolithic index. With multiple smaller indices you can choose to search only a subset of them, should that make sense for your app. How much does it cost to have 1 server with a LOT of RAM that serving this index will need? Maybe it's cheaper to have multiple smaller machines. How long does it take you to rebuild one big index, should it get corrupted vs. rebuilding only a subset of your data? How long does it take you to copy the index around the network after you optimize it vs. copying only a subset, or multiple subsets in parallel? etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Roberto Nieto [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, June 14, 2008 7:31:28 AM Subject: doubt with an index of 300gb Hi users, I´m going to create a big index of 300gb in a SAN where i have 4TB. I read many entries in the mail list talking about using multiple index with multicore. I would like to know what kind of benefit can i have using multiple index instead of one big index if i dont have problems with the disk? I know that the optimizes and the commits would be faster with smaller indexs, but in search? The RAM use would be the same using 10 indexes of 30gb than using 1 index of 300gb? Any suggestion or experience will be very usefull for me. Thanks in advance. Rober.
Memory problems when highlight with not very big index
Hi users/developers, I´m new with solr and i have been reading the list for a few hours but i didn´t found anything to solve my doubt. I´m using 5GB index in a 2GB RAM maquine, and i´m trying to optimize the solr configuration for searching. I´ve have good searching times but when i activate highlighting the RAM memory grows a lot, it grows the same as if a want to retrieve the content of the files found. I´m not sure if for highlighting solr needs to allocate all the content of the resulting documents to be able to highlight them. How it works? It´s possible to only allocate the 10 first results to make the snippet of only those results and use less memory? Thanks in advance. Rober.
Re: Memory problems when highlight with not very big index
Thanks for your fast answer, I think i tried to put default size to 0 and the problems persist but i will probe it on Monday again. The part that i can't understand very well is why if i desactivate highlighting the memory doesnt grows. It only uses doc cache if highlighting is used or if content retrieve is activated? Thnx Rober. 2008/6/13 Yonik Seeley [EMAIL PROTECTED]: On Fri, Jun 13, 2008 at 1:07 PM, Roberto Nieto [EMAIL PROTECTED] wrote: It´s possible to only allocate the 10 first results to make the snippet of only those results and use less memory? That's how it currently works. But there is a Document cache to make things more efficient. If you have large documents, you might want to decrease this from it's default size (see solrconfig.xml) which is currently 512. Perhaps move it down to 60 (which would allow for 6 concurrent requests of 10 docs each w/o re-fetching the doc between highlighting and response writing). -Yonik