Weighting categories
Hi, i've a table with products and their proper categories. Is it possible to weight categories, so that a user that searches for apple ipad don't get a magazine about apple ipad at the first result but the hardware apple ipad? I'm using DHI for indexing data, but don't know if there is any post-process to weight the categories I have. Thanks, Rmao
Re: multiple values encountered for non multiValued field type:[text/html, text, html]
error message: org.apache.solr.common.SolrException: ERROR: [http://bbs.dichan.com/] mult iple values encountered for non multiValued field type: [text/html, text, html] at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav a:242) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd ateProcessorFactory.java:60) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpd ateProcessorFactory.java:115) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl icationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF ilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV alve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextV alve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j ava:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j ava:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal ve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav a:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java :857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce ss(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:48 9) at java.lang.Thread.run(Thread.java:662) -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719103.html Sent from the Solr - User mailing list archive at Nabble.com.
Which Tokeniser (and/or filter)
Hi, I need to tokenise on whitespace, full-stop, and comma ONLY. Currently using solr.WhitespaceTokenizerFactory with WordDelimiterFilterFactory but this is also splitting on , /, new-line, etc. It seems such a simple setup, what am I doing wrong? what do you use for such normal searching? Thanks, Rob -- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com
Symbols in synonyms
is it good practice, common, or even possible to put symbols in my list of synonyms? I'm having trouble indexing and searching for AE, with it being split on the . we already convert .net to dotnet, but don't want to store every combination of 2 letters, AE, ME, etc. -- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com
Replication problem on windows
Hello! We have Solr running on Windows. Once in a while we see a problem with replication failing. While slave server replicates the index, it throws exception like the following: SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.2011102510\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException: D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot find the file specified) We've added commitReserveDuration to the master server configuration, but it didn't change that situation, the error still happens once in a while. Did anyone encounter such error ? -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Re: multiple values encountered for non multiValued field type:[text/html, text, html]
Hi I am not sure if what you are doing is possible i.e. having a schema other than that provided by nutch. The schema provided by nutch in its directory nutch-dir\conf is to be used as the solr schema. -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719253.html Sent from the Solr - User mailing list archive at Nabble.com.
Phonetic search and matching
Hi, I have a question on phonetic search and matching in solr. In our application all the content of an article is written to a full-text search field, which provides stemming and a phonetic filter (cologne phonetic for german). This is the relevant part of the configuration for the index analyzer (search is analogous): tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 / filter class=solr.PhoneticFilterFactory encoder=ColognePhonetic inject=true/ filter class=solr.RemoveDuplicatesTokenFilterFactory / Unfortunately this results sometimes in strange, but also explainable, matches. For example: Content field indexes the following String: Donnerstag von 13 bis 17 Uhr. This results in a match, if we search for puf as the result of the phonetic filter for this is 13. (As a consequence the 13 is then also highlighted) Does anyone has an idea how to handle this in a reasonable way that a search for puf does not match 13 in the content? Thanks in advance! Dirk
Improving performance for SOLR geo queries?
Hi, we need to perform fast geo lookups on an index of ~13M places, and were running into performance problems here with SOLR. We haven't done a lot of query optimization / SOLR tuning up until now so there's probably a lot of things we're missing. I was wondering if you could give me some feedback on the way we do things, whether they make sense, and especially why a supposed optimization we implemented recently seems to have no effect, when we actually thought it would help a lot. What we do is this: our API is built on a Rails stack and talks to SOLR via a Ruby wrapper. We have a few filters that almost always apply, which we put in filter queries. Filter cache hit rate is excellent, about 97%, and cache size caps at 10k filters (max size is 32k, but it never seems to reach that many, probably because we replicate / delta update every few minutes). Still, geo queries are slow, about 250-500msec on average. We send them with cache=false, so as to not flood the fq cache and cause undesirable evictions. Now our idea was this: while the actual geo queries are poorly cacheable, we could clearly identify geographical regions which are more often queried than others (naturally, since we're a user driven service). Therefore, we dynamically partition Earth into a static grid of overlapping boxes, where the grid size (the distance of the nodes) depends on the maximum allowed search radius. That way, for every user query, we would always be able to identify a single bounding box that covers it. This larger bounding box (200km edge length) we would send to SOLR as a cached filter query, along with the actual user query which would still be sent uncached. Ex: User asks for places in 10km around 49.14839,8.5691, then what we will send to SOLR is something like this: fq={!bbox cache=false d=10 sfield=location_ll pt=49.14839,8.5691} fq={!bbox cache=true d=100.0 sfield=location_ll pt=49.4684836290799,8.31165802979391} -- this one we derive automatically That way SOLR would intersect the two filters and return the same results as when only looking at the smaller bounding box, but keep the larger box in cache and speed up subsequent geo queries in the same regions. Or so we thought; unfortunately this approach did not help query execution times get better, at all. Question is: why does it not help? Shouldn't it be faster to search on a cached bbox with only a few hundred thousand places? Is it a good idea to make these kinds of optimizations in the app layer (we do this as part of resolving the SOLR query in Ruby), and does it make sense at all? We're not sure what kind of optimizations SOLR already does in its query planner. The documentation is (sorry) miserable, and debugQuery yields no insight into which optimizations are performed. So this has been a hit and miss game for us, which is very ineffective considering that it takes considerable time to build these kinds of optimizations in the app layer. Would be glad to hear your opinions / experience around this. Thanks! -- Matthias Käppler Lead Developer API Mobile Qype GmbH Großer Burstah 50-52 20457 Hamburg Telephone: +49 (0)40 - 219 019 2 - 160 Skype: m_kaeppler Email: matth...@qype.com Managing Director: Ian Brotherston Amtsgericht Hamburg HRB 95913 This e-mail and its attachments may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail and its attachments. Any unauthorized copying, disclosure or distribution of this e-mail and its attachments is strictly forbidden. This notice also applies to future messages.
Re: Which Tokeniser (and/or filter)
I need to tokenise on whitespace, full-stop, and comma ONLY. Currently using solr.WhitespaceTokenizerFactory with WordDelimiterFilterFactory but this is also splitting on , /, new-line, etc. WDF is customizable via types=wdftypes.txt parameter. https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/wdftypes.txt Alternatively you can convert . and , to whitespace (before tokenizer) by MappingCharFilterFactory. http://lucene.apache.org/solr/api/org/apache/solr/analysis/MappingCharFilterFactory.html
Re: Which Tokeniser (and/or filter)
My fear is what will then happen with highlighting if I use re-mapping? On Mon, 6 Feb 2012 03:33:03 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: I need to tokenise on whitespace, full-stop, and comma ONLY. Currently using solr.WhitespaceTokenizerFactory with WordDelimiterFilterFactory but this is also splitting on , /, new-line, etc. WDF is customizable via types=wdftypes.txt parameter. https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/wdftypes.txt Alternatively you can convert . and , to whitespace (before tokenizer) by MappingCharFilterFactory. http://lucene.apache.org/solr/api/org/apache/solr/analysis/MappingCharFilterFactory.html
Re: Which Tokeniser (and/or filter)
My fear is what will then happen with highlighting if I use re-mapping? What do you mean by re-mapping?
Re: Which Tokeniser (and/or filter)
mapping dots to spaces. I don't think that's workable anyway since .net would cause issues. Tying out the wdftypes now... --- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com On Mon, 6 Feb 2012 04:10:18 -0800 (PST), Ahmet Arslan iori...@yahoo.com wrote: My fear is what will then happen with highlighting if I use re-mapping? What do you mean by re-mapping?
multiple values encountered for non multiValued field type:[text/html, text, html]
Hi everyone: when i index my crawl result form some bbs site by solr, then i got that error. Is there someone could help me? my solr schema is : field name=id type=string indexed=true stored=true required=true / field name=sku type=text_en_splitting_tight indexed=true stored=true omitNorms=true/ field name=name type=text_general indexed=true stored=true/ field name=alphaNameSort type=alphaOnlySort indexed=true stored=false/ field name=manu type=text_general indexed=true stored=true omitNorms=true/ field name=cat type=string indexed=true stored=true multiValued=true/ field name=features type=text_general indexed=true stored=true multiValued=true/ field name=includes type=text_general indexed=true stored=true termVectors=true termPositions=true termOffsets=true / field name=weight type=float indexed=true stored=true/ field name=price type=float indexed=true stored=true/ field name=popularity type=int indexed=true stored=true / field name=inStock type=boolean indexed=true stored=true / field name=store type=location indexed=true stored=true/ field name=title type=text_general indexed=true stored=true multiValued=true/ field name=subject type=text_general indexed=true stored=true/ field name=description type=text_general indexed=true stored=true/ field name=comments type=text_general indexed=true stored=true/ field name=author type=text_general indexed=true stored=true/ field name=keywords type=text_general indexed=true stored=true/ field name=category type=text_general indexed=true stored=true/ field name=content_type type=string indexed=true stored=true multiValued=true/ field name=last_modified type=date indexed=true stored=true/ field name=links type=string indexed=true stored=true multiValued=true/ field name=url type=string indexed=true stored=true/ field name=content type=textMaxWord indexed=true stored=true multiValued=true/ field name=cache_content type=text_cache indexed=false stored=true/ field name=segment type=string indexed=false stored=true/ field name=boost type=float indexed=true stored=true/ field name=digest type=string indexed=false stored=true/ field name=host type=string indexed=true stored=false/ field name=cache type=string indexed=true stored=false/ field name=site type=string indexed=true stored=false/ field name=anchor type=string indexed=true stored=false multiValued=true/ field name=tstamp type=string indexed=true stored=true/ field name=date type=date indexed=true stored=true/ field name=type type=string indexed=true stored=true/ field name=text type=text_general indexed=true stored=true multiValued=true/ field name=simple type=textSimple indexed=true stored=true/ field name=complex type=textComplex indexed=true stored=true/ field name=text_rev type=text_general_rev indexed=true stored=false multiValued=true/ field name=manu_exact type=string indexed=true stored=false/ field name=payloads type=payloads indexed=true stored=true/ dynamicField name=*_i type=intindexed=true stored=true/ dynamicField name=*_s type=string indexed=true stored=true/ dynamicField name=*_l type=long indexed=true stored=true/ dynamicField name=*_t type=text_generalindexed=true stored=true/ dynamicField name=*_txt type=text_generalindexed=true stored=true/ dynamicField name=*_b type=boolean indexed=true stored=true/ dynamicField name=*_f type=float indexed=true stored=true/ dynamicField name=*_d type=double indexed=true stored=true/ dynamicField name=*_coordinate type=tdouble indexed=true stored=false/ dynamicField name=*_dt type=dateindexed=true stored=true/ dynamicField name=*_p type=location indexed=true stored=true/ dynamicField name=*_ti type=tintindexed=true stored=true/ dynamicField name=*_tl type=tlong indexed=true stored=true/ dynamicField name=*_tf type=tfloat indexed=true stored=true/ dynamicField name=*_td type=tdouble indexed=true stored=true/ dynamicField name=*_tdt type=tdate indexed=true stored=true/ dynamicField name=*_pi type=pintindexed=true stored=true/ dynamicField name=ignored_* type=ignored multiValued=true/ dynamicField name=attr_* type=text_general indexed=true stored=true multiValued=true/ dynamicField name=random_* type=random / /fields -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3719088.html Sent from the Solr - User mailing list archive at Nabble.com.
Searching context within a book
I'm very new to Solr and I'm evaluating it. My task is to look for words within a corpus of books and return them within a small context. So far, I'm storing the books in a database split by paragraphs (slicing the books by line breaks), I do a fulltext search and return the row. In Solr, would I have to do the same, or can I add the whole book (in .txt format) and, whenever a match is found, return something like the match plus 100 words before and 100 words after or something like that? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-context-within-a-book-tp3718997p3718997.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Parallel indexing in Solr
See response below Erick Erickson skrev: Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? SolrJ, CommonsHttpSolrServer But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. We will have few searches, but a lot of indexing. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). We are sending SolrInputDocuments 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. So Jetty in not a easy to use, but non-performance-container? 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Well eventaually we need to be able to index and delete about 50mio documents per day. We will need to keep a history of 2 years of data in our system, deletion will not start before we have been in production for 2 years. At that point in time the system needs to contain 2 year * 365 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio documents need to be deleted and index per day - before that we only need to index 50mio documents per day. We are aware that we are probably going to need a certain amout of hardware for this, but most important thing is that we make a scalable setup so that we can get to this kind of numbers at all. Right now I am focusing on getting most out of one Solr instance potentially with several cores, though. Best Erick On Fri, Feb 3, 2012 at 3:55 AM, Per Steffensen st...@designware.dk wrote: Hi This topic has probably been covered before, but I havnt had the luck to find the answer. We are running solr instances with several cores inside. Solr running out-of-the-box on top of jetty. I believe jetty is receiving all the http-requests about indexing ned documents, and forwards it to the solr engine. What kind of parallelism does this setup provide. Can more than one index-request get processed concurrently? How many? How to increase the number of index-requests that can be handled in parallel? Will I get better parallelism by running on another web-container than jetty - e.g. tomcat? What is the recommended web-container for high performance production systems? Thanks! Regards, Per Steffensen
Re: effect of continuous deletes on index's read performance
Your continuous deletes won't affect performance noticeably, that's true. But you're really doing bad things with the commit after every add or delete. You haven't said whether you have a master/ slave setup or not, but assuming you're searching on the same machine you're indexing to, each time you commit, you're forcing the underlying searcher to close and re-open and any attendant autowarming to occur. All to get a single document searchable. 20 times a second. If you have a master/ slave setup, you're forcing the slave to fetch the changed parts of the index every time it polls, which is better than what's happening on the master, but still rather often. 400K documents isn't very big by Solr standards, so unless you can show performance problems, I wouldn't be concerned about index size, as Otis says, your per-document commit is probably hurting you far more than any index size savings. I'd actually think carefully about whether you need even 10 second commits. If you can stretch that out to minutes, so much the better. But it all depends upon your problem space. Best Erick On Mon, Feb 6, 2012 at 2:59 AM, prasenjit mukherjee prasen@gmail.com wrote: Thanks Otis. commitWithin will definitely work for me ( as I currently am using 3.4 version, which doesnt have NRT yet ). Assuming that I use commitWithin=10secs, are you saying that the continuous deletes ( without commit ) wont have any affect on performance ? I was under the impression that deletes just mark the doc-ids ( essentially means that the index size will remain the same ) , but wont actually do the compaction till someone calls optimize/commit, is my assumption not true ? -Thanks, Prasenjit On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Prasenjit, It sounds like at this point your main enemy might be those per-doc-add commits. Don't commit until you need to see your new docs in results. And if you need NRT then use softCommit option with Solr trunk (http://search-lucene.com/?q=softcommitfc_project=Solr) or use commitWithin to limit commit's performance damage. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: prasenjit mukherjee prasen@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Monday, February 6, 2012 1:17 AM Subject: effect of continuous deletes on index's read performance I have a use case where documents are continuously added @ 20 docs/sec ( each doc add is also doing a commit ) and docs continuously getting deleted at the same rate. So the searchable index size remains the same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6). Will it have pauses when deletes triggers compaction. Or with every commits ( while adds ) ? How bad they will effect on search response time. -Thanks, Prasenjit
Re: effect of continuous deletes on index's read performance
You could also try Solr 3.4 with RankingAlgorithm as this offers NRT. You can get more information about NRT for Solr 3.4 from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 2/5/2012 11:59 PM, prasenjit mukherjee wrote: Thanks Otis. commitWithin will definitely work for me ( as I currently am using 3.4 version, which doesnt have NRT yet ). Assuming that I use commitWithin=10secs, are you saying that the continuous deletes ( without commit ) wont have any affect on performance ? I was under the impression that deletes just mark the doc-ids ( essentially means that the index size will remain the same ) , but wont actually do the compaction till someone calls optimize/commit, is my assumption not true ? -Thanks, Prasenjit On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Prasenjit, It sounds like at this point your main enemy might be those per-doc-add commits. Don't commit until you need to see your new docs in results. And if you need NRT then use softCommit option with Solr trunk (http://search-lucene.com/?q=softcommitfc_project=Solr) or use commitWithin to limit commit's performance damage. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: prasenjit mukherjeeprasen@gmail.com To: solr-usersolr-user@lucene.apache.org Sent: Monday, February 6, 2012 1:17 AM Subject: effect of continuous deletes on index's read performance I have a use case where documents are continuously added @ 20 docs/sec ( each doc add is also doing a commit ) and docs continuously getting deleted at the same rate. So the searchable index size remains the same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6). Will it have pauses when deletes triggers compaction. Or with every commits ( while adds ) ? How bad they will effect on search response time. -Thanks, Prasenjit
Re: effect of continuous deletes on index's read performance
Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share the same underlying in-memory datastructure so that IndexSearcher need not be reopened with every commit. On 2/6/12, Erick Erickson erickerick...@gmail.com wrote: Your continuous deletes won't affect performance noticeably, that's true. But you're really doing bad things with the commit after every add or delete. You haven't said whether you have a master/ slave setup or not, but assuming you're searching on the same machine you're indexing to, each time you commit, you're forcing the underlying searcher to close and re-open and any attendant autowarming to occur. All to get a single document searchable. 20 times a second. If you have a master/ slave setup, you're forcing the slave to fetch the changed parts of the index every time it polls, which is better than what's happening on the master, but still rather often. 400K documents isn't very big by Solr standards, so unless you can show performance problems, I wouldn't be concerned about index size, as Otis says, your per-document commit is probably hurting you far more than any index size savings. I'd actually think carefully about whether you need even 10 second commits. If you can stretch that out to minutes, so much the better. But it all depends upon your problem space. Best Erick On Mon, Feb 6, 2012 at 2:59 AM, prasenjit mukherjee prasen@gmail.com wrote: Thanks Otis. commitWithin will definitely work for me ( as I currently am using 3.4 version, which doesnt have NRT yet ). Assuming that I use commitWithin=10secs, are you saying that the continuous deletes ( without commit ) wont have any affect on performance ? I was under the impression that deletes just mark the doc-ids ( essentially means that the index size will remain the same ) , but wont actually do the compaction till someone calls optimize/commit, is my assumption not true ? -Thanks, Prasenjit On Mon, Feb 6, 2012 at 1:13 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Prasenjit, It sounds like at this point your main enemy might be those per-doc-add commits. Don't commit until you need to see your new docs in results. And if you need NRT then use softCommit option with Solr trunk (http://search-lucene.com/?q=softcommitfc_project=Solr) or use commitWithin to limit commit's performance damage. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: prasenjit mukherjee prasen@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Monday, February 6, 2012 1:17 AM Subject: effect of continuous deletes on index's read performance I have a use case where documents are continuously added @ 20 docs/sec ( each doc add is also doing a commit ) and docs continuously getting deleted at the same rate. So the searchable index size remains the same : ~ 400K docs ( docs for last 6 hours ~ 20*3600*6). Will it have pauses when deletes triggers compaction. Or with every commits ( while adds ) ? How bad they will effect on search response time. -Thanks, Prasenjit -- Sent from my mobile device
Re: effect of continuous deletes on index's read performance
On Mon, Feb 6, 2012 at 8:20 AM, prasenjit mukherjee prasen@gmail.com wrote: Pardon my ignorance, Why can't the IndexWriter and IndexSearcher share the same underlying in-memory datastructure so that IndexSearcher need not be reopened with every commit. Because the semantics of an IndexReader in Lucene guarantee an unchanging point-in-time view of the index, as of when that IndexReader was opened. That said, Lucene has near-real-time readers, which keep point-in-time semantics but are very fast to open after adding/deleting docs, and do not require a (costly) commit. EG see my blog post: http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html The tests I ran there indexed at a highish rate (~1000 1KB sized docs per second, or 1 MB plain text per second, or ~2X Twitter's peak rate, at least as of last July), and the reopen latency was fast (~ 60 msec). Admittedly this was a fast machine, and the index was on a good SSD, and I used NRTCachingDir and MemoryCodec for the id field. But net/net Lucene's NRT search is very fast. It should easily handle your 20 docs/second rate, unless your docs are enormous Solr trunk has finally cutover to using these APIs, but unfortunately this has not been backported to Solr 3.x. You might want to check out ElasticSearch, an alternative to Solr, which does use Lucene's NRT APIs Mike McCandless http://blog.mikemccandless.com
Is Solr waiting for data to arrive
Hi I have a setup where a lot is going on, but where there is about 80-90% IO-wait (%wa in top). I have a suspicion that this is due to slow networking. I would like someone to help med interpret threaddumps (retrieved using kill -3). Whenever I do threaddumps I see that most threads have this stacktrace: 2036752846@qtp-1221696456-205 prio=10 tid=0x7f8f50102000 nid=0x3a31 runnable [0x7f90908e3000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.mortbay.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:382) at org.mortbay.io.bio.StreamEndPoint.fill(StreamEndPoint.java:114) at org.mortbay.jetty.bio.SocketConnector$Connection.fill(SocketConnector.java:198) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:290) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Im not sure if this indicate 1) that they are just hanging around waiting for the next request (doing nothing) 2) or if it indicates that a request has been initiated and that the thread is waiting to receive the data I guess if it is 1) I havnt confirmed my suspiction, but if it is 2) I probably have. Can anyone help med with the interpretation. Thanks! Regards, Per Steffensen
Re: Replication problem on windows
On 2/6/2012 3:04 AM, Rafał Kuć wrote: Hello! We have Solr running on Windows. Once in a while we see a problem with replication failing. While slave server replicates the index, it throws exception like the following: SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.2011102510\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException: D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot find the file specified) We've addedcommitReserveDuration to the master server configuration, but it didn't change that situation, the error still happens once in a while. Did anyone encounter such error ? I found another old mailing list entry by searching google for your error message without filename/path. It looked like they solved it by adding/updating the following config line, found in solrconfig.xml in deletionPolicy, which is found in mainIndex. Increasing that number will increase the on-disk size of the index on your master server. str name=maxCommitsToKeep2/str The directory paths in their error messages look exactly like yours, down the the difference in case between the from and to strings, so I fear that I am pointing you at information that you already have. http://www.xkcd.com/979/ Thanks, Shawn
Re: Replication problem on windows
Hello! Thanks for the answer Shawn. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch On 2/6/2012 3:04 AM, Rafał Kuć wrote: Hello! We have Solr running on Windows. Once in a while we see a problem with replication failing. While slave server replicates the index, it throws exception like the following: SEVERE: Unable to copy index file from: D:\web\solr\collection\data\index.2011102510\_3s.fdt to: D:\web\solr\Collection\data\index\_3s.fdt java.io.FileNotFoundException: D:\web\solr\collection\data\index.2011102510\_3s.fdt (The system cannot find the file specified) We've addedcommitReserveDuration to the master server configuration, but it didn't change that situation, the error still happens once in a while. Did anyone encounter such error ? I found another old mailing list entry by searching google for your error message without filename/path. It looked like they solved it by adding/updating the following config line, found in solrconfig.xml in deletionPolicy, which is found in mainIndex. Increasing that number will increase the on-disk size of the index on your master server. str name=maxCommitsToKeep2/str The directory paths in their error messages look exactly like yours, down the the difference in case between the from and to strings, so I fear that I am pointing you at information that you already have. http://www.xkcd.com/979/ Thanks, Shawn
Re: Searching context within a book
You are probably better off splitting up each book into separate SOLR documents, one document per paragraph (each document with same book ID, ISBN, etc.). Then you can use field-collapsing on the book ID to return a single document per book. And you can use highlighting to show the paragraph that matched the query. You will need to store the full-text in SOLR in order to use highlighting feature and/or to return the text in the search results. On Feb 6, 2012, at 2:13 AM, pistacchio wrote: I'm very new to Solr and I'm evaluating it. My task is to look for words within a corpus of books and return them within a small context. So far, I'm storing the books in a database split by paragraphs (slicing the books by line breaks), I do a fulltext search and return the row. In Solr, would I have to do the same, or can I add the whole book (in .txt format) and, whenever a match is found, return something like the match plus 100 words before and 100 words after or something like that? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-context-within-a-book-tp3718997p3718997.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Parallel indexing in Solr
Right. See below. On Mon, Feb 6, 2012 at 7:53 AM, Per Steffensen st...@designware.dk wrote: See response below Erick Erickson skrev: Unfortunately, the answer is it depends(tm). First question: How are you indexing things? SolrJ? post.jar? SolrJ, CommonsHttpSolrServer But some observations: 1 sure, using multiple cores will have some parallelism. So will using a single core but using something like SolrJ and StreamingUpdateSolrServer. So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with a different constructor. Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? I already did, follow the link I provided. In 3.x, you'll see some pauses when segments are merged that you can't get around (per core). See: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ for an excellent writeup. But whether or not you use several cores should be determined by your problem space, certainly not by trying to increase the throughput. Indexing usually take a back seat to search performance. We will have few searches, but a lot of indexing. Hmmm, this the inverse of most installations, so it's good to know. 2 general settings are hard to come by. If you're sending structured documents that use Tika to parse the data behind the scenes, your performance will be much different (slower) than sending SolrInputDocuments (SolrJ). We are sending SolrInputDocuments 3 The recommended servlet container is, generally, The one you're most comfortable with. Tomcat is certainly popular. That said, use whatever you're most comfortable with until you see a performance problem. Odds are you'll find your load on Solr is a at its limit before your servlet container has problems. So Jetty in not a easy to use, but non-performance-container? Again, test and see. Lots of commercial systems use Jetty. Consider that you're just sending sets of documents at Solr, the container is doing very little work. You are batching up your Solr documents aren't you? 4 Monitor you CPU, fire more requests at it until it hits 100%. Note that there are occasions where the servlet container limits the number of outstanding requests it will allow and queues ones over that limit (find the magic setting to increase this if it's a problem, it differs by container). If you start to see your response times lengthen but the CPU not being fully utilized, that may be the cause. Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. You should see this differ with StreamingUpdateSolrServer assuming your client can feed documents fast enough. You can consider having multiple clients feed the same solr indexer if necessary. 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. Well eventaually we need to be able to index and delete about 50mio documents per day. We will need to keep a history of 2 years of data in our system, deletion will not start before we have been in production for 2 years. At that point in time the system needs to contain 2 year * 365 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio documents need to be deleted and index per day - before that we only need to index 50mio documents per day. We are aware that we are probably going to need a certain amout of hardware for this, but most important thing is that we make a scalable setup so that we can get to this kind of numbers at all. Right now I am focusing on getting most out of one Solr instance potentially with several cores, though. My off-the-top-of-my-head feeling is that this will be a LOT of hardware. You'll without doubt be sharding the index. NOTE: Shards are cores, just special purpose ones, i.e. they're all use the same schema. When Solr folks see cores, we assume that the several cores that may have different schemas and handle unrelated queries. It sounds like you're talking about a sharded
Re: Parallel indexing in Solr
On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? -- Sami Siren
Re: Parallel indexing in Solr
So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? Nope. Use StreamingUpdateSolrServer, it should be just a drop-in with a different constructor. I will try to do that. It is a little bit difficult for me, as we are actually not dealing with Solr ourselves. We are using Lily, but I will modify Lily, compile and try to see how goes. Especially with trunk (4.0) and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this Document Writer Per Thread stuff. A link or something? I already did, follow the link I provided. Ahh ok, didnt get it the first time, that the link below was about that http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ So Jetty in not a easy to use, but non-performance-container? Again, test and see. Lots of commercial systems use Jetty. Consider that you're just sending sets of documents at Solr, the container is doing very little work. You are batching up your Solr documents aren't you? Havnt looked into Lily to see whether or not documents are batched, but I will. I didnt expect Jetty to be the problem, basically just wanted to know that is was not a stupid everything-in-a-single-thread container, almost designed to not perform (because the focus might be different, e.g. providing an easy-to-use/understand container for testing etc.) Actually right now, I am trying to find our what my bottleneck is. You should see this differ with StreamingUpdateSolrServer assuming your client can feed documents fast enough. You can consider having multiple clients feed the same solr indexer if necessary. Thanks! 5 How high is high performance? On a stock solr with the Wikipedia dump (11M docs), all running on my laptop, I see 7K docs/sec indexed. I know of installations that see 60 docs/sec or even less. I'm sending simple docs with SolrJ locally and they're sending huge documents over the wire that Tika handles. There are just so many variables it's hard to say anything except try it and see.. 50mio documents need to be deleted and indexed per day. 2 years history = 36 billion docs in store My off-the-top-of-my-head feeling is that this will be a LOT of hardware. Well it takes what it takes. Someone else will buy the hardware. My first concern is to make sure we have a system that scales, so that we can buy us out of problems by buying more hardware. On the other hand of course I want to privide at system that makes the most of the hardware. You'll without doubt be sharding the index. NOTE: Shards are cores, just special purpose ones, i.e. they're all use the same schema. When Solr folks see cores, we assume that the several cores that may have different schemas and handle unrelated queries. It sounds like you're talking about a sharded system rather than independent cores, is that so? Yes that is correct. We only have one single schema/config shared by all cores through ZK. So the many cores are just for sharding, because I do not expect that it will work very well with 20 billion docs in the same core/shard :-) You should have no trouble indexing 50M documents/day, even assuming that the ingestion rate is not evenly distributed. The link I referenced talks about indexing 10M documents in a little over 6 minutes. YMMV however. I think you're going along the right path when trying to push a single indexer to the max. My setup uses Jetty and is getting 5-7K docs/second so I doubt it's inherently a Jetty problem, although there may be configuration tweaks getting in your way. Bottom line: I doubt it's a Jetty issue at this point but I've been wrong on too many occasions to count. I'd be looking other places first though. Start with the streaming update solr server though, and also whether your clients can spit out documents fast enough... I will have a look at all that. Thanks! Best Erick
Re: Parallel indexing in Solr
Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... -- Sami Siren
Commit call - ReadTimeoutException - usage scenario for big update requests and the ioexception case
Hi, i wonder if it is possible to commit data to solr without having to catch SockedReadTimeout Exceptions. I am calling commit(false, false) using a streaming server instance - but i still have to wait 30 seconds and catch the timeout from http method. I does not matter if its 30 or 60, it will fail depending on how long it takes until the update request is processed, or can i tweak things here? So whats the way to go here? Any other option or must i fetch those exception and go on like done now. The operation itself does finish successful - later on when its done - on server side and all stuff is committed and searchable. regards Torsten smime.p7s Description: S/MIME cryptographic signature
Re: SolrCell maximum file size
Thanks for the tips Erick, i'm really talking about 2.5GB files full of data to be indexed. Like .csv files or .xls, .ods and so on. I guess I will try to do a great increase on the memory the JVM will be able to use. Regards, Augusto Erick Erickson erickerick...@gmail.com 1/27/2012 1:22 pm Hmmm, I'd go considerably higher than 2.5G. Problem is you the Tika processing will need memory, I have no idea how much. Then you'll have a bunch of stuff for Solr to index it etc. But I also suspect that this will be about useless to index (assuming you're talking lots of data, not say just the meta-data associated with a video or something). How do you provide a meaningful snippet of such a huge amount of data? If it *is* say a video or whatever where almost all of the data won't make it into the index anyway, you're probably better off using tika directly on the client and only sending the bits to Solr that you need in the form of a SolrInputDocument (I'm thinking that you'll be doing this in SolrJ) rather than transmit 2.5G over the network and throwing almost all of it away If the entire 2.5G is data to be indexed, you'll probably want to consider breaking it up into smaller chunks in order to make it useful. Best Erick On Fri, Jan 27, 2012 at 3:43 AM, Augusto Camarotti augu...@prpb.mpf.gov.br wrote: I'm talking about 2 GB files. It means that I'll have to allocate something bigger than that for the JVM? Something like 2,5 GB? Thanks, Augusto Camarotti Erick Erickson erickerick...@gmail.com 1/25/2012 1:48 pm Mostly it depends on your container settings, quite often that's where the limits are. I don't think Solr imposes any restrictions. What size are we talking about anyway? There are implicit issues with how much memory parsing the file requires, but you can allocate lots of memory to the JVM to handle that. Best Erick On Tue, Jan 24, 2012 at 10:24 AM, Augusto Camarotti augu...@prpb.mpf.gov.br wrote: Hi everybody Does anyone knows if there is a maximum file size that can be uploaded to the extractingrequesthandler via http request? Thanks in advance, Augusto Camarotti
Re: Parallel indexing in Solr
grin. I've had recurring discussions with executive level folks that no matter how many VMs you host on a machine, and no matter how big that machine is, there really, truly, *is* some hardware underlying it all that really, truly, *does* have some limits. And adding more VMs doesn't somehow get around those limits.. Good Luck! Erick On Mon, Feb 6, 2012 at 10:55 AM, Per Steffensen st...@designware.dk wrote: Sami Siren skrev: On Mon, Feb 6, 2012 at 2:53 PM, Per Steffensen st...@designware.dk wrote: Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% real CPU usage. It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. What kind of/how many discs do you have for your shards? ..also what kind of server are you experimenting with? Grrr, thats where I have a little fight with operations. For now they gave me one (fairly big) machine with XenServer. I create my machines as Xen VM's on top of that. One of the things I dont like about this (besides that I dont trust Xen to do its virtualization right, or at least not provide me with correct readings on IO) is that disk space is assigned from an iSCSI connected SAN that they all share (including the line out there). But for now actually it doesnt look like disk IO problems. It looks like networks-bottlenecks (but to some extend they all also shard network) among all the components in our setup - our client plus Lily stack (HDFS, HBase, ZK, Lily Server, Solr etc). Well it is complex, but anyways ... -- Sami Siren
solrcore.properties
Looking at SOLR-1335 and the wiki, I'm not quite sure of the final behavior for this. These properties are per-core, and not visible in other cores, right? Are variables substituted in solr.xml, so I can swap in different properties files for dev, test, and prod? Like this: core name=mary properties=conf/solrcore-${env:dev}.properties/ If that does not work, what are the best practices for managing dev/test/prod configs for Solr? wunder -- Walter Underwood wun...@wunderwood.org Search Guy, Chegg.com
Re: Performance degradation with distributed search
Yonik, Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? XJ -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-degradation-with-distributed-search-tp3715060p3720739.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance degradation with distributed search
BTW we just upgraded to Solr 3.5 from Solr 1.4. Thats why we want to explore the improvements/new features of distributed search. On Mon, Feb 6, 2012 at 12:30 PM, oleole oleol...@gmail.com wrote: Yonik, Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? XJ -- View this message in context: http://lucene.472066.n3.nabble.com/Performance-degradation-with-distributed-search-tp3715060p3720739.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance degradation with distributed search
On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote: Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? It currently does calculate it twice... but only for those documents being returned (which should not be significant). What is rows set to? -Yonik lucidimagination.com
Re: Performance degradation with distributed search
hm.. just looked at the log only 112 matched, and start=0, rows=30 On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote: Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? It currently does calculate it twice... but only for those documents being returned (which should not be significant). What is rows set to? -Yonik lucidimagination.com
Re: Performance degradation with distributed search
On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote: hm.. just looked at the log only 112 matched, and start=0, rows=30 Are any of the sort criteria sort-by-function with anything complex (like an embedded relevance query)? -Yonik lucidimagination.com On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote: Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? It currently does calculate it twice... but only for those documents being returned (which should not be significant). What is rows set to? -Yonik lucidimagination.com
Re: Performance degradation with distributed search
Yes as I mentioned in previous email, we do dismax queries(with different mm values), solr function queries (map, etc) math calculations (sum, product, log). I understand those are expensive. But worst case it should only double the time not going from 200ms to 1200ms right? XJ On Mon, Feb 6, 2012 at 2:37 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote: hm.. just looked at the log only 112 matched, and start=0, rows=30 Are any of the sort criteria sort-by-function with anything complex (like an embedded relevance query)? -Yonik lucidimagination.com On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote: Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? It currently does calculate it twice... but only for those documents being returned (which should not be significant). What is rows set to? -Yonik lucidimagination.com
Re: Performance degradation with distributed search
On Mon, Feb 6, 2012 at 5:53 PM, XJ oleol...@gmail.com wrote: Yes as I mentioned in previous email, we do dismax queries(with different mm values), solr function queries (map, etc) math calculations (sum, product, log). I understand those are expensive. But worst case it should only double the time not going from 200ms to 1200ms right? You mention dismax... but I assume that's as the main query and you sort by score (which is fine). The only issue with relevancy queries is if you sorted by one that was not the main query - this is not yet optimized. But for straight function queries that don't contain embedded relevancy queries, I would definitely not expect the degradation you are seeing - hence we should try to get to the bottom of this. -Yonik lucidimagination.com XJ On Mon, Feb 6, 2012 at 2:37 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Feb 6, 2012 at 5:35 PM, XJ oleol...@gmail.com wrote: hm.. just looked at the log only 112 matched, and start=0, rows=30 Are any of the sort criteria sort-by-function with anything complex (like an embedded relevance query)? -Yonik lucidimagination.com On Mon, Feb 6, 2012 at 1:33 PM, Yonik Seeley yo...@lucidimagination.com wrote: On Mon, Feb 6, 2012 at 3:30 PM, oleole oleol...@gmail.com wrote: Thanks for your reply. Yeah that's the first thing I tried (adding fsv=true to the query) and it surprised me too. Could it due to we're using many complex sortings (20 sortings with dismax, and, or...). Any thing it can be optimized? Looks like it's calculated twice in solr? It currently does calculate it twice... but only for those documents being returned (which should not be significant). What is rows set to? -Yonik lucidimagination.com
spell check - preserve case in suggestions
Hi, Say that the field name has the following terms: Giants Manning New York When someone searches for gants or Gants, I need the suggestion to be returned as Giants (capital G - same case as in the content that was indexed). Using lowercase filter in both index and query analyzers I get the suggestion giants, but all the letters are in smaller case. Is it possible to preserve the case in suggestions, yet get suggestions for input term in upper or lower or mixed case? Thanks, Satish
Re: Solr with Scala
I have created a solr plugin using scala. It works without problems. I wouldn't go as far as using scala improve solr performance but you can definitely use scala to add a missing functionality or custom query parsing. Just build a jar using maven/sbt and put it in solr's lib directory. On Sun, Feb 5, 2012 at 4:06 PM, deniz denizdurmu...@gmail.com wrote: Hi all, I have a question about scala and solr... I am curious if we can use solr with scala (plugins etc) to improve performance. anybody used scala on solr? could you tell me opinions about them? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-with-Scala-tp3718539p3718539.html Sent from the Solr - User mailing list archive at Nabble.com. -- Tommy Chheng
Re: multiple values encountered for non multiValued field type:[text/html, text, html]
Thank you for your reply, it is much helpful for me ! -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-multiValued-field-type-text-html-text-html-tp3719088p3721305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performance degradation with distributed search
Yonik, thanks for your explanation. I've created a ticket here https://issues.apache.org/jira/browse/SOLR-3104 On Mon, Feb 6, 2012 at 4:28 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Feb 6, 2012 at 6:16 PM, XJ oleol...@gmail.com wrote: Sorry I didn't make this clear. Yeah we use dismax in main query, as well as in sort orders (different from main queries). Because of our complicated business logic, we need many different relevancy queries in different sort orders (other than sort by score, we also have around 20 other different sort orders, some of them are dismax queries). However, this is something we can not get away from right now. What kind of optimization I can try to do there? OK, so basically it's slow because functions with embedded relevancy queries are forward only - if you request the value for a docid previous to the last, we need to reboot the query (re-weight, ask for the scorer, etc). This means that for your 30 documents, that will require rebooting the query about 15 times (assuming that roughly half of the time the next docid will be less than the previous one). Unfortunately there's not much you can do externally... we need to implement optimizations at the Solr level for this. Can you open a JIRA issue for this? -Yonik lucidimagination.com
Re: summing facets on a specific field
you can use the StatsComponent http://wiki.apache.org/solr/StatsComponent with stats=truestats.price=categorystats.facet=category and pull the sum fields from the resulting stats facets. Johannes 2012/2/5 Paul Kapla paul.ka...@gmail.com: Hi everyone, I'm pretty new to solr and I'm not sure if this can even be done. Is there a way to sum a specific field per each item in a facet. For example, you have an ecommerce site that has the following documents: id,category,name,price 1,books,'solr book', $10.00 2,books,'lucene in action', $12.00 3.video, 'cool video', $20.00 so instead of getting (when faceting on category) books(2) video(1) I'd like to get: books ($22) video ($20) Is this something that can be even done? Any feedback would be much appreciated. -- Dipl.-Ing.(FH) Johannes Goll 211 Curry Ford Lane Gaithersburg, Maryland 20878 USA
Re: summing facets on a specific field
I meant stats=truestats.field=pricestats.facet=category 2012/2/6 Johannes Goll johannes.g...@gmail.com: you can use the StatsComponent http://wiki.apache.org/solr/StatsComponent with stats=truestats.price=categorystats.facet=category and pull the sum fields from the resulting stats facets. Johannes 2012/2/5 Paul Kapla paul.ka...@gmail.com: Hi everyone, I'm pretty new to solr and I'm not sure if this can even be done. Is there a way to sum a specific field per each item in a facet. For example, you have an ecommerce site that has the following documents: id,category,name,price 1,books,'solr book', $10.00 2,books,'lucene in action', $12.00 3.video, 'cool video', $20.00 so instead of getting (when faceting on category) books(2) video(1) I'd like to get: books ($22) video ($20) Is this something that can be even done? Any feedback would be much appreciated. -- Dipl.-Ing.(FH) Johannes Goll 211 Curry Ford Lane Gaithersburg, Maryland 20878 USA -- Dipl.-Ing.(FH) Johannes Goll 211 Curry Ford Lane Gaithersburg, Maryland 20878 USA