Re: Query foreign language synonyms / words of equivalent meaning?
As far as I know, there is no built-in functionality for language translation. I would propose to write one, but there are many many pitfalls. If you want to translate from one language to another you might have to know the starting language. Otherwise you get problems with translation. Not (german) - distress (english), affliction (english) - you might have words in one language which are stopwords in other language not - you don't have a one to one mapping, it's more like 1 to n+x toilette (french) - bathroom, rest room / restroom, powder room This are just two points which jump into my mind but there are tons of pitfalls. We use the solution of a multilingual thesaurus as synonym dictionary. http://en.wikipedia.org/wiki/Eurovoc It holds translations of 22 official languages of the European Union. So a search for europäischer währungsfonds gives also results with european monetary fund, fonds monétaire européen, ... Regards Bernd Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com: Hi, English is going to be the predominant language used in my documents, but there may be a spattering of words in other languages (such as Spanish or French). What I'd like is to initiate a query for something like bathroom for example and for Solr to return documents that not only contain bathroom but also baño (Spanish). And the same goes when searching for baño. I'd like Solr to return documents that contain either bathroom or baño. One possibility is to pre-translate all indexed documents to a common language, in this case English. And if someone were to search using a foreign word, I'd need to translate that to English before issuing a query to Solr. This appears to be problematic, since I'd have to know whether the indexed words and the query are even in a foreign language, which is not trivial. Another possibility is to pre-build a list of foreign word synonyms. So baño would be listed as a synonym for bathroom. But I'd need to include other languages (such as toilette in French) and other words. This requires that I know in advance all possible words I'd need to include foreign language versions of (not to mention needing to know which languages to include). This isn't trivial either. I'm assuming there's no built-in functionality that supports the foreign language translation on the fly, so what do people propose? Thanks! -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie Universitätsstr. 25 und Wissensmanagement 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: Solrcloud dataimport failed at first time after restart
I have found the reason. The reason is that I am using jboss JNDI datasource, and oracle driver is placed in WEB-INFO/lib, this is a very common error, driver should be placed in %JBOSS_HOME%\server\default\lib. 2012/10/10 jun Wang wangjun...@gmail.com Hi, all I found that dataimport will failed at first time after restart. and the log is here . It's seem like a bug. 2012-10-09 20:00:08,848 ERROR dataimport.DataImporter - Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select a.id, a.subject, a.keywords, a.category_id, to_number((a.gmt_modified - to_date('1970-01-01','-mm-dd'))*24*60*60) as gmt_modified,a.member_seq,b.standard_attr_desc, b.custom_attr_desc, decode(a.product_min_price, null, 0, a.product_min_price)/100 as min_price, sign(a.ws_offline_date - sysdate) + 1 as is_offlinefrom ws_product_draft a, ws_product_attribute_draft bwhere a.id = b.product_id(+) Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select a.id, a.subject, a.keywords, a.category_id, to_number((a.gmt_modified - to_date('1970-01-01','-mm-dd'))*24*60*60) as gmt_modified,a.member_seq,b.standard_attr_desc, b.custom_attr_desc, decode(a.product_min_price, null, 0, a.product_min_price)/100 as min_price, sign(a.ws_offline_date - sysdate) + 1 as is_offlinefrom ws_product_draft a, ws_product_attribute_draft bwhere a.id = b.product_id(+) Processing Document # 1 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:413) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select a.id, a.subject, a.keywords, a.category_id, to_number((a.gmt_modified - to_date('1970-01-01','-mm-dd'))*24*60*60) as gmt_modified,a.member_seq, b.standard_attr_desc, b.custom_attr_desc, decode(a.product_min_price, null, 0, a.product_min_price)/100 as min_price, sign(a.ws_offline_date - sysdate) + 1 as is_offline from ws_product_draft a, ws_product_attribute_draft b where a.id = b.product_id(+) Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:252) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:209) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:38) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411) ... 5 more Caused by: java.lang.ClassNotFoundException: Unable to load null or org.apache.solr.handler.dataimport.null at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:899) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:159) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:127) at org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:362) at org.apache.solr.handler.dataimport.JdbcDataSource.access$200(JdbcDataSource.java:38) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:239) ... 12 more Caused by: java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:387) at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:889) ... 17 more -- from Jun
Re: search by multiple 'LIKE' operator connected with 'AND' operator
I'm also unable to config that type of search through schema.xml. As I use SOLR in drupal, I've implement that in hook_search_api_solr_query_alter by exploding my search string on two (or more) chunks and now search works well. Strangely that couldn'y do it through SOLR configuration. -- View this message in context: http://lucene.472066.n3.nabble.com/search-by-multiple-LIKE-operator-connected-with-AND-operator-tp4012536p4012861.html Sent from the Solr - User mailing list archive at Nabble.com.
Form too large error in SOLR4.0
Hi, Recently we have upgraded solr 1.4 version to 4.0 version. After upgrading we are experiencing unusual behavior in SOLR4.0. The same query is working properly in solr 1.4 and it is throwing SEVERE: null:java.lang.IllegalStateException: Form too large161138720 error in solr4.0. I have increased maxFormContentSize value in jetty.xml Call name=setAttribute Argorg.eclipse.jetty.server.Request.maxFormContentSize/Arg Arg50/Arg /Call But still i am facing same issue. Can some one please help me to resolve this issue. Full Stack trace: Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.IllegalStateException: Form too large161138720 at org.eclipse.jetty.server.Request.extractParameters(Request.java:279) at org.eclipse.jetty.server.Request.getParameterMap(Request.java:705) at org.apache.solr.request.ServletSolrParams.init(ServletSolrParams.java:29) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:394) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Server at http://localhost:8983/solr/core0 returned non ok status:500, message:Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:182) at org.apache.solr.handler.component.HttpShardHandler$1.call(Htt Thanks, Ravi -- View this message in context: http://lucene.472066.n3.nabble.com/Form-too-large-error-in-SOLR4-0-tp4012868.html Sent from the Solr - User mailing list archive at Nabble.com.
Using additional dictionary with DirectSolrSpellChecker
Is there some way to supplement the DirectSolrSpellChecker with a dictionary? (In some cases terms are not used because of threshold, but should be offered as spellcheck suggestion) -- View this message in context: http://lucene.472066.n3.nabble.com/Using-additional-dictionary-with-DirectSolrSpellChecker-tp4012873.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr - Make Exact Search on Field with Fuzzy Query
0 down vote favorite We are using solr 3.6. We have field named Description. We want searching feature with stemming and also without stemming (exact word/phrase search), with highlighting in both . For that , we had made lot of research and come to conclusion, to use the copy field with data type which doesn't have stemming factory. it is working fine at now. (main field has stemming and copy field has not.) The data for that field is very large and we are having millions of documents; and as we want, both searching and highlighting on them; we need to keep this copy field stored and indexed both. which will increase index size a lot. we need to eliminate this duplication if possible any how. From the recent research, we read that combining fuzzy search with dismax will fulfill our requirement. (we have tried a bit but not getting success.) Please let me know , if this is possible, or any other solutions to make this happen. Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Installing Solr on a shared hosting server?
some time back I used dreamhost for a Solr based project. Looks as though all their offerings, including shared hosting have Java support - see http://wiki.dreamhost.com/What_We_Support. I was very happy with their service and support. -Simon On Tue, Oct 9, 2012 at 10:44 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Bluehost doesn't seem to support Java processes, so unfortunately the answer seems to be no. You might want to look into getting a Linode or some other similar VPS hosting. Solr needs RAM to function well, though, so you're not going to be able to go with the cheapest option. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Tue, Oct 9, 2012 at 9:27 AM, caiod ca...@me.com wrote: I was wondering if I can install Solr on bluehost's shared hosting to use as a website search, and also how do I do so? Thank you... -- View this message in context: http://lucene.472066.n3.nabble.com/Installing-Solr-on-a-shared-hosting-server-tp4012708.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Wild card searching - well sort of
Hi - The WordDelimiterFilter can help you get *-BAAN-* for A100-BAAN-C20 but only because BAAN is surrounded with characters the filter splits and combines upon. -Original message- From:Kissue Kissue kissue...@gmail.com Sent: Wed 10-Oct-2012 14:20 To: solr-user@lucene.apache.org Subject: Wild card searching - well sort of Hi, I am wondering if there is a way i can get Solr to do this: I have added the string: *-BAAN-* to the index to a field called pattern which is a string type. Now i want to be able to search for A100-BAAN-C20 or ZA20-BAAN-300 and have Solr return *-BAAN-*. Any ideas how i can accomplish something like this? I am currently using Solr 3.5 with solrJ. Thanks.
Re: Form too large error in SOLR4.0
Hi, Check jetty configs, this looks like an error from the container. Otis -- Performance Monitoring - http://sematext.com/spm On Oct 10, 2012 4:50 AM, ravicv ravichandra...@gmail.com wrote: Hi, Recently we have upgraded solr 1.4 version to 4.0 version. After upgrading we are experiencing unusual behavior in SOLR4.0. The same query is working properly in solr 1.4 and it is throwing SEVERE: null:java.lang.IllegalStateException: Form too large161138720 error in solr4.0. I have increased maxFormContentSize value in jetty.xml Call name=setAttribute Argorg.eclipse.jetty.server.Request.maxFormContentSize/Arg Arg50/Arg /Call But still i am facing same issue. Can some one please help me to resolve this issue. Full Stack trace: Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.IllegalStateException: Form too large161138720 at org.eclipse.jetty.server.Request.extractParameters(Request.java:279) at org.eclipse.jetty.server.Request.getParameterMap(Request.java:705) at org.apache.solr.request.ServletSolrParams.init(ServletSolrParams.java:29) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:394) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Server at http://localhost:8983/solr/core0 returned non ok status:500, message:Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:182) at org.apache.solr.handler.component.HttpShardHandler$1.call(Htt Thanks, Ravi -- View this message in context: http://lucene.472066.n3.nabble.com/Form-too-large-error-in-SOLR4-0-tp4012868.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Form too large error in SOLR4.0
1611387 is 1,611,387 which is clearly greater than your revised limit of 50 = 500,000. Try setting the limit to 2,000,000 = 200. Or maybe even 5,000,000 = 500. -- Jack Krupansky -Original Message- From: ravicv Sent: Wednesday, October 10, 2012 4:49 AM To: solr-user@lucene.apache.org Subject: Form too large error in SOLR4.0 Hi, Recently we have upgraded solr 1.4 version to 4.0 version. After upgrading we are experiencing unusual behavior in SOLR4.0. The same query is working properly in solr 1.4 and it is throwing SEVERE: null:java.lang.IllegalStateException: Form too large161138720 error in solr4.0. I have increased maxFormContentSize value in jetty.xml Call name=setAttribute Argorg.eclipse.jetty.server.Request.maxFormContentSize/Arg Arg50/Arg /Call But still i am facing same issue. Can some one please help me to resolve this issue. Full Stack trace: Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: null:java.lang.IllegalStateException: Form too large161138720 at org.eclipse.jetty.server.Request.extractParameters(Request.java:279) at org.eclipse.jetty.server.Request.getParameterMap(Request.java:705) at org.apache.solr.request.ServletSolrParams.init(ServletSolrParams.java:29) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:394) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:115) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:240) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:900) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:954) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:857) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534) at java.lang.Thread.run(Thread.java:662) Oct 10, 2012 3:20:43 AM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Server at http://localhost:8983/solr/core0 returned non ok status:500, message:Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:182) at org.apache.solr.handler.component.HttpShardHandler$1.call(Htt Thanks, Ravi -- View this message in context: http://lucene.472066.n3.nabble.com/Form-too-large-error-in-SOLR4-0-tp4012868.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Questions about query times
OK so I solved the question about the query that returns no results and still takes time - I needed to add the facet.mincount=1 parameter and this reduced the time to 200-300 ms instead of seconds. I still could't figure out why a query that returns very few results (like query number 2) still takes seconds to return even with the facet.mincount=1 parameter. I couldn't understand why the facet pivot takes so much time on 299 docs. Does anyone have any idea? Example Query: (2) q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 299 Times(ms): Qtime: 2,756 Query: 307 Facet: 2,449 On Thu, Sep 20, 2012 at 5:24 PM, Yuval Dotan yuvaldo...@gmail.com wrote: Hi, We have a system that inserts logs continuously (real-time). We have been using the Solr facet pivot feature for querying and have been experiencing slow query times and we were hoping to gain some insights with your help. schema and solrconfig are attached Here are our questions (data below): 1. Why is facet time so long in (3) and (5) - in cases where there are 0 or very few results? 2. We ran two queries that are only differ in the time limit (for the second query - time range is very small) - we got the same time for both queries although the second one returned very few results - again why is that? 3. Is there a way to improve pivot facet time? System Data: Index size: 63 GB RAM:4Gb CPU: 2 x Xeon E5410 2.33GHz Num of Documents: 109,278,476 query examples: - (1) Query: q=*:*fq=(trimTime:[2012-09-04T14:29:24Z TO *])fq=(trimTime:[2012-09-04T14:29:24Z TO *])f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 11,407,889 Times (ms): Qtime: 3,239 Query: 353 Facet: 2,885 - (2) Query: q=*:*fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 299 Times(ms): Qtime: 2,756 Query: 307 Facet: 2,449 - (3) Query: q=*:*fq=(trimTime:[2012-09-11T12:55:00Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-04T15:23:48Z TO *])fq=(Confidence_Level:(N/A)) OR (Confidence_Level:(Medium-High)) OR (Confidence_Level:(High))f.product.facet.sort=indexf.product.facet.limit=-1f.Severity.facet.sort=indexf.Severity.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: 7 Times(ms): Qtime: 2,798 Query: 312 Facet: 2,485 - (4) Query: q=*:*fq=(trimTime:[2012-09-04T15:43:16Z TO *])fq=(trimTime:[2012-09-04T15:43:16Z TO *])fq=(product:(Application Control)) OR (product:(URL Filtering))f.appi_name.facet.sort=indexf.appi_name.facet.limit=-1f.app_risk.facet.sort=indexf.app_risk.facet.limit=-1f.matched_category.facet.sort=indexf.matched_category.facet.limit=-1f.trimTime.facet.sort=indexf.trimTime.facet.limit=-1facet=truef.appi_name.facet.method=enumfacet.pivot=appi_name,app_risk,matched_category,trimTimeexf.trimTime.facet.limit=-1facet=truef.product.facet.method=enumfacet.pivot=product,Severity,trimTime NumFound: more than 30M Times(ms): Qtime: 23,288 - (5) Query: q=*:*fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(Severity:(High Critical))fq=(trimTime:[2012-09-05T06:03:55Z TO *])fq=(product:(IPS)) OR (product:(SmartDefense))fq=(action:(Detect)) OR
Re: Wild card searching - well sort of
1. What is your specific motivation for wanting to do this? (Sounds like yet another XY problem!) 2. What specific rules are you expecting to use for synthesis of patterns from the raw data? For the latter, do you expect to index hand-coded specific patterns to be returned or do you have some sort of machine learning method in mind that will generate the patterns by examining all of the values? -- Jack Krupansky -Original Message- From: Kissue Kissue Sent: Wednesday, October 10, 2012 8:15 AM To: solr-user@lucene.apache.org Subject: Wild card searching - well sort of Hi, I am wondering if there is a way i can get Solr to do this: I have added the string: *-BAAN-* to the index to a field called pattern which is a string type. Now i want to be able to search for A100-BAAN-C20 or ZA20-BAAN-300 and have Solr return *-BAAN-*. Any ideas how i can accomplish something like this? I am currently using Solr 3.5 with solrJ. Thanks.
Re: Solr - Make Exact Search on Field with Fuzzy Query
There's nothing really built in to Solr to allow this. Are you absolutely sure you can't just use the copyfield? Have you actually tried it? But I don't think you need to store the contents twice. Just store it once and always highlight on that field whether you search it or not. Since it's the raw text, you should be fine. You'll have two versions of the field tokenized of course, but that should take less space than you might think. You probably want to store the version with the stemming turned on... That said, storing twice only uses up some disk space, it doesn't require additional memory for searching. So unless you're running out of disk space you can just keep two stored versions around. But If none of that works you might write a custom filter that emits two tokens for each input token at indexing time, similar to what synonyms do. The original should have some special character appended, say $ and the second should be the results of stemming (note, there will be two tokens even if there is no stemming done). So, indexing running would index running$ and run. Now, when you need to search for an exact match on running, you search for running$. This works for the reverse too. Since the rule is append $ to all original tokens run gets indexed as run$ and run. Now, searching for run matches as does run$. But run$ does not match the doc that had running since the two tokens emitted in that case are run and running$. But look at what's happened here. You're indexing two tokens for every one token in the input. Furthermore, you're adding a bunch of unique tokens to the index. It's hard to see how this results in any savings over just using copyField. You have to index the two tokens since you have to distinguish between the stemmed and un-stemmed version. You might be able to do something really exotic with payloads. This is _really_ out of left field, but it just occurred to me. You'd have to define a transformation from the original word into the stemmed word that created a unique value. Something like no stemming - 0 removing ing - 1 removing s- 2 etc. Actually, this would have to be some kind of function on the letters removed so that removing ing mapped to, say, the ordinal position of the letter in the alphabet * position * 100. So ing would map to 'i' - 'a' + ('n' - 'a') * 100 + ('g' - 'a') * 1 etc... (you'd have to take considerable care to get this right for any code sets that had more than 100 possible code points)... Now, you've included the information about what the original word was and could use the payload to fail to match in the exact-match case. Of course the other issue would be to figure out the syntax to get the fact that you wanted an exact match down into your custom scorer. But as you can see, any scheme is harder than just flipping a switch, so I'd _really_ verify that you can't just use copyField Best Erick On Wed, Oct 10, 2012 at 7:38 AM, meghana meghana.rav...@amultek.com wrote: 0 down vote favorite We are using solr 3.6. We have field named Description. We want searching feature with stemming and also without stemming (exact word/phrase search), with highlighting in both . For that , we had made lot of research and come to conclusion, to use the copy field with data type which doesn't have stemming factory. it is working fine at now. (main field has stemming and copy field has not.) The data for that field is very large and we are having millions of documents; and as we want, both searching and highlighting on them; we need to keep this copy field stored and indexed both. which will increase index size a lot. we need to eliminate this duplication if possible any how. From the recent research, we read that combining fuzzy search with dismax will fulfill our requirement. (we have tried a bit but not getting success.) Please let me know , if this is possible, or any other solutions to make this happen. Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: segment number during optimize of index
Guys, thanks for all the inputs, I was continuing my research to know more about segments in Lucene. Below are my conclusion, please correct me if am wrong. 1. Segments are independent sub-indexes in seperate file, while indexing its better to create new segment as it doesnt have to modify an existing file. where as while searching, smaller the segment the better it is since you open x (not exactly x but xn a value proportional to x) physical files to search if you have got x segments in the index. 2. since lucene has memory map concept, for each file/segment in index a new m-map file is created and mapped to the physcial file in disk. Can someone explain or correct this in detail, i am sure there are lot many people wondering how m-map works while you merge or optimze index segments. On 6 October 2012 07:41, Otis Gospodnetic otis.gospodne...@gmail.comwrote: If I were you and not knowing all your details... I would optimize indices that are static (not being modified) and would optimize down to 1 segment. I would do it when search traffic is low. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Fri, Oct 5, 2012 at 4:27 PM, jame vaalet jamevaa...@gmail.com wrote: Hi Eric, I am in a major dilemma with my index now. I have got 8 cores each around 300 GB in size and half of them are deleted documents in it and above that each has got around 100 segments as well. Do i issue a expungeDelete and allow the merge policy to take care of the segments or optimize them into single segment. Search performance is not at par compared to usual solr speed. If i have to optimize what segment number should i choose? my RAM size around 120 GB and JVM heap is around 45 GB (oldGen being 30 GB). Pleas advice ! thanks. On 6 October 2012 00:00, Erick Erickson erickerick...@gmail.com wrote: because eventually you'd run out of file handles. Imagine a long-running server with 100,000 segments. Totally unmanageable. I think shawn was emphasizing that RAM requirements don't depend on the number of segments. There are other resources that file consume however. Best Erick On Fri, Oct 5, 2012 at 1:08 PM, jame vaalet jamevaa...@gmail.com wrote: hi Shawn, thanks for the detailed explanation. I have got one doubt, you said it doesn matter how many segments index have but then why does solr has this merge policy which merges segments frequently? why can it leave the segments as it is rather than merging smaller one's into bigger one? thanks . On 5 October 2012 05:46, Shawn Heisey s...@elyograg.org wrote: On 10/4/2012 3:22 PM, jame vaalet wrote: so imagine i have merged the 150 Gb index into single segment, this would make a single segment of 150 GB in memory. When new docs are indexed it wouldn't alter this 150 Gb index unless i update or delete the older docs, right? will 150 Gb single segment have problem with memory swapping at OS level? Supplement to my previous reply: the real memory mentioned in the last paragraph does not include the memory that the OS uses to cache disk access. If more memory is needed and all the free memory is being used by the disk cache, the OS will throw away part of the disk cache (a near-instantaneous operation that should never involve disk I/O) and give that memory to the application that requests it. Here's a very good breakdown of how memory gets used with MMapDirectory in Solr. It's applicable to any program that uses memory mapping, not just Solr: http://java.dzone.com/**articles/use-lucene%E2%80%99s-**mmapdirectory http://java.dzone.com/articles/use-lucene%E2%80%99s-mmapdirectory Thanks, Shawn -- -JAME -- -JAME -- -JAME
Re: Using additional dictionary with DirectSolrSpellChecker
I don't want to tweak the threshold. For majority of cases it works fine. It's for cases where term has low frequency but is spelled correctly. If you lower the threshold you would also get incorrect spelled terms as suggestions. Robert Muir wrote These thresholds are adjustable: read the javadocs and tweak them. On Wed, Oct 10, 2012 at 5:59 AM, O. Klein lt; klein@ gt; wrote: Is there some way to supplement the DirectSolrSpellChecker with a dictionary? (In some cases terms are not used because of threshold, but should be offered as spellcheck suggestion) -- View this message in context: http://lucene.472066.n3.nabble.com/Using-additional-dictionary-with-DirectSolrSpellChecker-tp4012873.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Using-additional-dictionary-with-DirectSolrSpellChecker-tp4012873p4012908.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wild card searching - well sort of
On Wed, 2012-10-10 at 14:15 +0200, Kissue Kissue wrote: I have added the string: *-BAAN-* to the index to a field called pattern which is a string type. Now i want to be able to search for A100-BAAN-C20 or ZA20-BAAN-300 and have Solr return *-BAAN-*. That sounds a lot like the problem presented in the thread Indexing wildcard patterns: http://web.archiveorange.com/archive/v/AAfXfcuIJY9BQJL3mjty The short answer is no, Solr does not support this in the general form. But maybe you can make it work anyway. In your example, the two queries A100-BAAN-C20 and ZA20-BAAN-300 share the form [4 random characters]-[4 significant characters]-[3 random characters] so a little bit of pre-processing would rewrite that to *-[4 significant characters]-* which would match *-BAAN-* If you describe the patterns and common elements to your indexed terms and to your queries, we might come up with something.
Re: Synonym Filter: Removing all original tokens, retain matched synonyms
The synonym filter does set the type attribute to TYPE_SYNONYM for synonyms, so you could write your own token filter that keeps only tokens with that type. Try the Solr Admin analysis page to see how various terms are analyzed by the synonym filter. It will show TYPE_SYNONYM. -- Jack Krupansky -Original Message- From: Daniel Rosher Sent: Wednesday, October 10, 2012 8:34 AM To: solr-user@lucene.apache.org Subject: Synonym Filter: Removing all original tokens, retain matched synonyms Hi, Is there a way to do this? Token_Input: the fox jumped over the lazy dog Synonym_Map: fox = vulpes dog = canine Token_Output: vulpes canine So remove all tokens, but retain those matched against the synonym map Cheers, Dan
Re: SolrJ 4.0 Beta maxConnectionsPerHost
On Wed, Oct 10, 2012 at 12:02 AM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: *Sami* The client IS instantiated only once and not for every request. I was curious if this was part of the problem. Do I need to re-instantiate the object for each request made? No, it is expensive if you instantiate the client every time. When the client seems to be hanging, can you still access the Solr instance normally and execute updates/searches from other clients? -- Sami Siren
Re: Synonym Filter: Removing all original tokens, retain matched synonyms
Token_Input: the fox jumped over the lazy dog Synonym_Map: fox = vulpes dog = canine Token_Output: vulpes canine So remove all tokens, but retain those matched against the synonym map May be you can make use of http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/analysis/KeepWordFilterFactory.html. You need to copy entries (vulpes, canine) from synonym.txt into keepwords.txt file.
Re: Synonym Filter: Removing all original tokens, retain matched synonyms
Ah ha .. good thinking ... thanks! Dan On Wed, Oct 10, 2012 at 2:39 PM, Ahmet Arslan iori...@yahoo.com wrote: Token_Input: the fox jumped over the lazy dog Synonym_Map: fox = vulpes dog = canine Token_Output: vulpes canine So remove all tokens, but retain those matched against the synonym map May be you can make use of http://lucene.apache.org/solr/api-4_0_0-ALPHA/org/apache/solr/analysis/KeepWordFilterFactory.html . You need to copy entries (vulpes, canine) from synonym.txt into keepwords.txt file.
Re: SolrJ 4.0 Beta maxConnectionsPerHost
There are other updates that happen on the server that do not fail, so the answer to your question is yes. On Wed, Oct 10, 2012 at 8:12 AM, Sami Siren ssi...@gmail.com wrote: On Wed, Oct 10, 2012 at 12:02 AM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: *Sami* The client IS instantiated only once and not for every request. I was curious if this was part of the problem. Do I need to re-instantiate the object for each request made? No, it is expensive if you instantiate the client every time. When the client seems to be hanging, can you still access the Solr instance normally and execute updates/searches from other clients? -- Sami Siren
Re: SolrJ 4.0 Beta maxConnectionsPerHost
On Wed, Oct 10, 2012 at 5:36 PM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: There are other updates that happen on the server that do not fail, so the answer to your question is yes. The other updates are using solrj or something else? It would be helpful if you could prepare a simple java program that uses solrj to demonstrate the problem. Based on the available information it is really difficult try to guess what's happening. -- Sami Siren
Re: SolrJ 4.0 Beta maxConnectionsPerHost
They are both SolrJ. What is happening is I have a batch indexer application that does a full re-index once per day. I also have an incremental indexer that takes items off a queue when they are updated. The problem only happens when both are running at the same time - they also run from the same machine. I am going to dig into this today and see what I find - I didn't get around to it yesterday. Question: I don't seem to see a StreamingUpdateSolrServer object on the 4.0 beta. I did see the ConcurrentUpdateSolrServer - this seems like a similar choice. Is this correct? On Wed, Oct 10, 2012 at 9:43 AM, Sami Siren ssi...@gmail.com wrote: On Wed, Oct 10, 2012 at 5:36 PM, Briggs Thompson w.briggs.thomp...@gmail.com wrote: There are other updates that happen on the server that do not fail, so the answer to your question is yes. The other updates are using solrj or something else? It would be helpful if you could prepare a simple java program that uses solrj to demonstrate the problem. Based on the available information it is really difficult try to guess what's happening. -- Sami Siren
Unique terms without faceting
Hi, I know that you can use a facet query to get the unique terms for a field taking account of any q or fq parameters but for our use case the counts are not needed. So is there a more efficient way of finding just unique terms for a field? Phil
Re: Unique terms without faceting
The Solr TermsComponent: http://wiki.apache.org/solr/TermsComponent -- Jack Krupansky -Original Message- From: Phil Hoy Sent: Wednesday, October 10, 2012 11:45 AM To: solr-user@lucene.apache.org Subject: Unique terms without faceting Hi, I know that you can use a facet query to get the unique terms for a field taking account of any q or fq parameters but for our use case the counts are not needed. So is there a more efficient way of finding just unique terms for a field? Phil
Re: Faceted search question (Tokenizing)
Here is another simpler example of what I am trying to achieve: Multi-Valued Field 1: Data 1 Data 2 Data 3 Data 4 Multi-Valued Field 2: Data 11 Data 12 Data 13 Data 14 Multi-Valued Field 3: Data 21 Data 22 Data 23 Data 24 How can I specify that Data 1,Data 11 and data 21 are all related? And if I facet Data 1 + Data 11, I only want to see Data 21. -- View this message in context: http://lucene.472066.n3.nabble.com/Faceted-search-question-Tokenizing-tp4012948p4012956.html Sent from the Solr - User mailing list archive at Nabble.com.
Problem with delete by query in Solr 4.0 beta
I cannot seem to get delete by query working in my simple setup in Solr 4.0 beta. I have a single collection and I want to delete old documents from it. There is a single solr node in the config (no replication, not distributed). This is something that I previously did in Solr 3.x My collection is called dine, so I do: curl http://localhost:8080/solr/dine/update; -s -H 'Content-type:text/xml; charset=utf-8' -d deletequerytimestamp_dt:[2012-09-01T00:00:00Z TO 2012-09-27T00:00:00Z]/query/delete and then a commit. The problem is that the documents are not delete. When I run the same query in the admin page, it still returns documents. I walked through the code and find the code in DistributedUpdateProcessor::doDeleteByQuery to be suspicious. Specifically, vinfo is not null, but I have no version field, so versionsStored is false. So it gets to line 786, which looks like: if (versionsStored) { That then skips to line 813 (the finally clause) skipping all calls to doLocalDelete Now, I do confess I don't understand exactly how this code should work. However, in the add code, the check for versionsStored does not skip the call to doLocalAdd. Any suggestions would be welcome. Andrew
Faceted search question (Tokenizing)
Hey There, We have the following data structure: - Person -- Interest 1 --- Subinterest 1 --- Subinterest 1 Description --- Subinterest 1 ID -- Interest 2 --- Subinterest 2 --- Subinterest 2 Description --- Subinterest 2 ID . -- Interest 99 --- Subinterest 99 --- Subinterest 99 Description --- Subinterest 99 ID Interest, Subinterest, Subinterest Description and Subinterest IDs are all multiavlued fields. A person can have any number of subinterests,descriptions and IDS. How could we faced/search this based on this data structure? Right now we tokenized everything in a seperate multivalued column in the following fasion; |Interest='Interest 1',Subinterest='Subinterest 1',Subinterest='Another Subinterest 1',Description='Interest 1 Description',ID='Interest 1 ID'| |Interest='Interest 2',Description='Interest 2 Description',ID='Interest 2 ID'| I have a feeling like this is a wrong approach to this problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Faceted-search-question-Tokenizing-tp4012948.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Unique terms without faceting
Hi, I don't think you can use that component whilst taking into account any fq or q parameters. Phil -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: 10 October 2012 16:51 To: solr-user@lucene.apache.org Subject: Re: Unique terms without faceting The Solr TermsComponent: http://wiki.apache.org/solr/TermsComponent -- Jack Krupansky -Original Message- From: Phil Hoy Sent: Wednesday, October 10, 2012 11:45 AM To: solr-user@lucene.apache.org Subject: Unique terms without faceting Hi, I know that you can use a facet query to get the unique terms for a field taking account of any q or fq parameters but for our use case the counts are not needed. So is there a more efficient way of finding just unique terms for a field? Phil __ This email has been scanned by the brightsolid Email Security System. Powered by MessageLabs __
PointType doc reindex issue
Hello, I have a weird problem, Whenever I read the doc from solr and then index the same doc that already exists in the index (aka reindexing) I get the following error. Can somebody tell me what I am doing wrong. I use solr 3.6 and the definition of the field is given below fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ Exception in thread main org.apache.solr.client.solrj.SolrServerException: Server at http://testsolr:8080/solr/mycore returned non ok status:400, message:ERROR: [doc=1182684] multiple values encountered for non multiValued field geolocation_0_coordinate: [39.017608, 39.017608] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at com.wpost.search.indexing.MyTest.main(MyTest.java:31) The data in the index looks as follows str name=geolocation39.017608,-77.375239/str arr name=geolocation_0_coordinate double39.017608/double double39.017608/double /arr arr name=geolocation_1_coordinate double-77.375239/double double-77.375239/double /arr Thanks Ravi Kiran Bhaskar
Re: PointType doc reindex issue
You need remove field after read solr doc, when u add new field it will add to list, so when u try to commit the update field, it will be multi value and in your schema it is single value On Oct 10, 2012 9:26 AM, Ravi Solr ravis...@gmail.com wrote: Hello, I have a weird problem, Whenever I read the doc from solr and then index the same doc that already exists in the index (aka reindexing) I get the following error. Can somebody tell me what I am doing wrong. I use solr 3.6 and the definition of the field is given below fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ Exception in thread main org.apache.solr.client.solrj.SolrServerException: Server at http://testsolr:8080/solr/mycore returned non ok status:400, message:ERROR: [doc=1182684] multiple values encountered for non multiValued field geolocation_0_coordinate: [39.017608, 39.017608] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at com.wpost.search.indexing.MyTest.main(MyTest.java:31) The data in the index looks as follows str name=geolocation39.017608,-77.375239/str arr name=geolocation_0_coordinate double39.017608/double double39.017608/double /arr arr name=geolocation_1_coordinate double-77.375239/double double-77.375239/double /arr Thanks Ravi Kiran Bhaskar
Memory Cost of group.cache.percent parameter
Does anyone have a clear understanding of how group.caching achieves it's performance improvements memory wise? Percent means percent of maxDoc so it's a function of that, but is it a function of that *per* item in the cache (like filterCache) or altogether? The speed improvement looks pretty dramatic for our macDoc=25M index but it would be helpful to understand what the costs are. Mike -- View this message in context: http://lucene.472066.n3.nabble.com/Memory-Cost-of-group-cache-percent-parameter-tp4012967.html Sent from the Solr - User mailing list archive at Nabble.com.
Filter results based on custom scoring and _val_
I'm using solr function queries to generate my own custom score. I achieve this using something along these lines: q=_val_:my_custom_function() This populates the score field as expected, but it also includes documents that score 0. I need a way to filter the results so that scores below zero are not included. I realize that I'm using score in a non-standard way and that normally the score that lucene/solr produce is not absolute. However, producing my own score works really well for my needs. I've tried using {!frange l=0} but this causes the score for all documents to be 1.0. I've found that I can do the following: q=*:*fl=foo:my_custom_function()fq={!frange l=1}my_custom_function() This puts my custom score into foo, but it requires me to list all the logic twice. Sometimes my logic is very long. -- View this message in context: http://lucene.472066.n3.nabble.com/Filter-results-based-on-custom-scoring-and-val-tp4012968.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: PointType doc reindex issue
Gopal I did in fact test the same and it worked when I delete ted the geolocation_0_coordinate and geolocation_1_coordinate. But that seems weird, so I was thinking if there is something else I need to do to avoid doing this awkward workaround. Ravi Kiran Bhaskar On Wed, Oct 10, 2012 at 12:36 PM, Gopal Patwa gopalpa...@gmail.com wrote: You need remove field after read solr doc, when u add new field it will add to list, so when u try to commit the update field, it will be multi value and in your schema it is single value On Oct 10, 2012 9:26 AM, Ravi Solr ravis...@gmail.com wrote: Hello, I have a weird problem, Whenever I read the doc from solr and then index the same doc that already exists in the index (aka reindexing) I get the following error. Can somebody tell me what I am doing wrong. I use solr 3.6 and the definition of the field is given below fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ Exception in thread main org.apache.solr.client.solrj.SolrServerException: Server at http://testsolr:8080/solr/mycore returned non ok status:400, message:ERROR: [doc=1182684] multiple values encountered for non multiValued field geolocation_0_coordinate: [39.017608, 39.017608] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at com.wpost.search.indexing.MyTest.main(MyTest.java:31) The data in the index looks as follows str name=geolocation39.017608,-77.375239/str arr name=geolocation_0_coordinate double39.017608/double double39.017608/double /arr arr name=geolocation_1_coordinate double-77.375239/double double-77.375239/double /arr Thanks Ravi Kiran Bhaskar
Re: PointType doc reindex issue
Instead addfield method use setfield On Oct 10, 2012 9:54 AM, Ravi Solr ravis...@gmail.com wrote: Gopal I did in fact test the same and it worked when I delete ted the geolocation_0_coordinate and geolocation_1_coordinate. But that seems weird, so I was thinking if there is something else I need to do to avoid doing this awkward workaround. Ravi Kiran Bhaskar On Wed, Oct 10, 2012 at 12:36 PM, Gopal Patwa gopalpa...@gmail.com wrote: You need remove field after read solr doc, when u add new field it will add to list, so when u try to commit the update field, it will be multi value and in your schema it is single value On Oct 10, 2012 9:26 AM, Ravi Solr ravis...@gmail.com wrote: Hello, I have a weird problem, Whenever I read the doc from solr and then index the same doc that already exists in the index (aka reindexing) I get the following error. Can somebody tell me what I am doing wrong. I use solr 3.6 and the definition of the field is given below fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ Exception in thread main org.apache.solr.client.solrj.SolrServerException: Server at http://testsolr:8080/solr/mycore returned non ok status:400, message:ERROR: [doc=1182684] multiple values encountered for non multiValued field geolocation_0_coordinate: [39.017608, 39.017608] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at com.wpost.search.indexing.MyTest.main(MyTest.java:31) The data in the index looks as follows str name=geolocation39.017608,-77.375239/str arr name=geolocation_0_coordinate double39.017608/double double39.017608/double /arr arr name=geolocation_1_coordinate double-77.375239/double double-77.375239/double /arr Thanks Ravi Kiran Bhaskar
Re: PointType doc reindex issue
I am using DirectXmlRequest to index XML. This is just a test case as my client would be sending me a SOLR compliant XML. so I was trying to simulate it by reading a doc from an exiting core and reindexing it. HttpSolrServer server = new HttpSolrServer(http://testsolr:8080/solr/mycore;); QueryResponse resp = server.query(new SolrQuery(contentid:(1184911 OR 1182684))); SolrDocumentList list = resp.getResults(); if(list != null !list.isEmpty()) { for(SolrDocument doc : list) { SolrInputDocument iDoc = ClientUtils.toSolrInputDocument(doc); String contentid = (String) iDoc.getFieldValue(egcontentid); String name = (String) iDoc.getFieldValue(name); iDoc.setField(name, DigestUtils.md5Hex(name)); String xml = ClientUtils.toXML(iDoc); DirectXmlRequest up = new DirectXmlRequest(/update, add+xml+/add); server.request(up); server.commit(); System.out.println(Updated name in contentid - + contentid); } } Ravi Kiran On Wed, Oct 10, 2012 at 1:02 PM, Gopal Patwa gopalpa...@gmail.com wrote: Instead addfield method use setfield On Oct 10, 2012 9:54 AM, Ravi Solr ravis...@gmail.com wrote: Gopal I did in fact test the same and it worked when I delete ted the geolocation_0_coordinate and geolocation_1_coordinate. But that seems weird, so I was thinking if there is something else I need to do to avoid doing this awkward workaround. Ravi Kiran Bhaskar On Wed, Oct 10, 2012 at 12:36 PM, Gopal Patwa gopalpa...@gmail.com wrote: You need remove field after read solr doc, when u add new field it will add to list, so when u try to commit the update field, it will be multi value and in your schema it is single value On Oct 10, 2012 9:26 AM, Ravi Solr ravis...@gmail.com wrote: Hello, I have a weird problem, Whenever I read the doc from solr and then index the same doc that already exists in the index (aka reindexing) I get the following error. Can somebody tell me what I am doing wrong. I use solr 3.6 and the definition of the field is given below fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ Exception in thread main org.apache.solr.client.solrj.SolrServerException: Server at http://testsolr:8080/solr/mycore returned non ok status:400, message:ERROR: [doc=1182684] multiple values encountered for non multiValued field geolocation_0_coordinate: [39.017608, 39.017608] at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) at com.wpost.search.indexing.MyTest.main(MyTest.java:31) The data in the index looks as follows str name=geolocation39.017608,-77.375239/str arr name=geolocation_0_coordinate double39.017608/double double39.017608/double /arr arr name=geolocation_1_coordinate double-77.375239/double double-77.375239/double /arr Thanks Ravi Kiran Bhaskar
Re: Problem with delete by query in Solr 4.0 beta
Do you have a _version_ field in your schema. I believe SOLR 4.0 Beta requires that field. Probably he is hitting this https://issues.apache.org/jira/browse/SOLR-3432
Creating a new Collection through API
Hi, what is the best way to create a new Collection through the API so I get an own config folder with schema.xml and solrconfig.xml inside the created Core? When I just create a Collection, only the data folder will be created but the config folder with schema.xml and solrconfig.xml will be used from another Collection. Even when I add the config folder later, I have to reload the core on every server to get the changes :( Do I have to create a default Core somewhere, copy it inside my solr folder, rename it and then add this as a Collection or is there a better way to do this? Thanks, Markus
Re: anyone have any clues about this exception
Something timed out, the other end closed the connection. This end tried to write to closed pipe and died, something tried to catch that exception and write its own and died even worse? Just making it up really, but sounds good (plus a 3-year Java tech-support hunch). If it happens often enough, see if you can run WireShark on that machine's network interface and catch the whole network conversation in action. Often, there is enough clues there by looking at tcp packets and/or stuff transmitted. WireShark is a power-tool, so takes a little while the first time, but the learning will pay for itself over and over again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Oct 10, 2012 at 11:31 PM, Petersen, Robert rober...@buy.com wrote: Tomcat localhost log (not the catalina log) for my solr 3.6.1 (master) instance contains lots of these exceptions but solr itself seems to be doing fine... any ideas? I'm not seeing these exceptions being logged on my slave servers btw, just the master where we do our indexing only. Oct 9, 2012 5:34:11 PM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet default threw exception java.lang.IllegalStateException at org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:389) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:291) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Unknown Source)
Re: SolrJ 4.0 Beta maxConnectionsPerHost
On 10/9/2012 3:02 PM, Briggs Thompson wrote: *Otis* - jstack is a great suggestion, thanks! The problem didn't happen this morning but next time it does I will certainly get the dump to see exactly where the app is swimming around. I haven't used StreamingUpdateSolrServer but I will see if that makes a difference. Are there any major drawbacks of going this route? One caveat -- when using the Streaming/Concurrent object, your application will not be notified when there is a problem indexing. I've been told there is a way to override a method in the object to allow trapping errors, but I have not seen sample code and haven't figured out how to do it. I've filed an issue and a patch to fix this. It's received some comments, but so far nobody has decided to commit it. https://issues.apache.org/jira/browse/SOLR-3284 Thanks, Shawn
Re: PriorityQueue:initialize consistently showing up as hot spot while profiling
Hi Mikhail, On Fri, Oct 5, 2012 at 7:15 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: okay. huge rows value is no.1 way to kill Lucene. It's not possible, absolutely. You need to rethink logic of your component. Check Solr's FieldCollapsing code, IIRC it makes second search to achieve similar goal. Also check PostFilter and DelegatingCollector classes, their approach can also be handy for your task. This sounds like it could be a much saner way to handle the task, however, I'm not sure what I should be looking at for the 'FieldCollapsing code' you mention - can you point me to a class? Also, is there anything more you can say about PostFilter and DelegatingCollector classes - I reviewed them but it was not obvious to me what they were doing that would allow me to reduce the large rows param we use to ensure all relevant docs are included in the grouping and limiting occurs at the group level, rather than pre-grouping... Thanks again, Aaron
Re: PointType doc reindex issue
: I have a weird problem, Whenever I read the doc from solr and : then index the same doc that already exists in the index (aka : reindexing) I get the following error. Can somebody tell me what I am : doing wrong. I use solr 3.6 and the definition of the field is given : below When you use the LatLonType field type you get synthetic *_coordinate fields automicaly constructed under the covers from each of your fields that use a latlon fieldType. because you have configured the *_coordinate fields to be stored they are included in the response when you request the doc. this means that unless you explicitly remove those synthetically constructed values before reindexing, they will still be there in addition to the new (posisbly redundent) synthetic values created while indexing. This is why the *_coordinate dynamicField in the solr example schema.xml is marked 'stored=false' so that this field doesn't come back in the response -- it's not ment for end users. : fieldType name=latlong class=solr.LatLonType subFieldSuffix=_coordinate/ : dynamicField name=*_coordinate type=tdouble indexed=true stored=true/ : : Exception in thread main : org.apache.solr.client.solrj.SolrServerException: Server at : http://testsolr:8080/solr/mycore returned non ok status:400, : message:ERROR: [doc=1182684] multiple values encountered for non : multiValued field geolocation_0_coordinate: [39.017608, 39.017608] : at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328) : at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:211) : at com.wpost.search.indexing.MyTest.main(MyTest.java:31) : : : The data in the index looks as follows : : str name=geolocation39.017608,-77.375239/str : arr name=geolocation_0_coordinate : double39.017608/double : double39.017608/double : /arr : arr name=geolocation_1_coordinate : double-77.375239/double : double-77.375239/double : /arr : : Thanks : : Ravi Kiran Bhaskar : -Hoss
RE: anyone have any clues about this exception
You could be right. Going back in the logs, I noticed it used to happen less frequently and always towards the end of an optimize operation. It is probably my indexer timing out waiting for updates to occur during optimizes. The errors grew recently due to my upping the indexer threadcount to 22 threads, so there's a lot more timeouts occurring now. Also our index has grown to double the old size so the optimize operation has started taking a lot longer, also contributing to what I'm seeing. I have just changed my optimize frequency from three times a day to one time a day after reading the following: Here they are talking about completely deprecating the optimize command in the next version of solr… https://issues.apache.org/jira/browse/SOLR-3141c -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Wednesday, October 10, 2012 11:10 AM To: solr-user@lucene.apache.org Subject: Re: anyone have any clues about this exception Something timed out, the other end closed the connection. This end tried to write to closed pipe and died, something tried to catch that exception and write its own and died even worse? Just making it up really, but sounds good (plus a 3-year Java tech-support hunch). If it happens often enough, see if you can run WireShark on that machine's network interface and catch the whole network conversation in action. Often, there is enough clues there by looking at tcp packets and/or stuff transmitted. WireShark is a power-tool, so takes a little while the first time, but the learning will pay for itself over and over again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Oct 10, 2012 at 11:31 PM, Petersen, Robert rober...@buy.com wrote: Tomcat localhost log (not the catalina log) for my solr 3.6.1 (master) instance contains lots of these exceptions but solr itself seems to be doing fine... any ideas? I'm not seeing these exceptions being logged on my slave servers btw, just the master where we do our indexing only. Oct 9, 2012 5:34:11 PM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet default threw exception java.lang.IllegalStateException at org.apache.catalina.connector.ResponseFacade.sendError(ResponseFacade.java:407) at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:389) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:291) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at com.googlecode.psiprobe.Tomcat60AgentValve.invoke(Tomcat60AgentValve.java:30) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Unknown Source)
Re: SolrJ 4.0 Beta maxConnectionsPerHost
Thanks for the heads up. I just tested this and you are right. I am making a call to addBeans and it succeeds without any issue even when the server is down. That sucks. A big part of this process is reliant on knowing exactly what has made it into the index and what has not, so this a difficult problem to solve when you can't catch exceptions. I was thinking I could execute a ping request first to determine if the Solr server is still operational, but that doesn't help if the updateRequestHandler fails. On Wed, Oct 10, 2012 at 1:48 PM, Shawn Heisey s...@elyograg.org wrote: On 10/9/2012 3:02 PM, Briggs Thompson wrote: *Otis* - jstack is a great suggestion, thanks! The problem didn't happen this morning but next time it does I will certainly get the dump to see exactly where the app is swimming around. I haven't used StreamingUpdateSolrServer but I will see if that makes a difference. Are there any major drawbacks of going this route? One caveat -- when using the Streaming/Concurrent object, your application will not be notified when there is a problem indexing. I've been told there is a way to override a method in the object to allow trapping errors, but I have not seen sample code and haven't figured out how to do it. I've filed an issue and a patch to fix this. It's received some comments, but so far nobody has decided to commit it. https://issues.apache.org/**jira/browse/SOLR-3284https://issues.apache.org/jira/browse/SOLR-3284 Thanks, Shawn
Re: Why is SolrDispatchFilter using 90% of the Time?
When I look at the distribution of the Response-time I notice 'SolrDispatchFilter.doFilter()' is taking up 90% of the time. That's pretty much the top-level entry point to Solr (from the servlet container), so it's normal. -Yonik http://lucidworks.com
RE: Faceted search question (Tokenizing)
What do you want the results to be, persons? And the facets should be interests or subinterests? Why are there two layers of interests anyway? Can there my many subinterests under one interest? Is one of those two a name of the interest which would look nice as a facet? Anyway, have you read these pages yet? These should get you started in the right direction. http://wiki.apache.org/solr/SolrFacetingOverview http://wiki.apache.org/solr/HierarchicalFaceting Hope that helps, Robi -Original Message- From: Grapes [mailto:mkloub...@gmail.com] Sent: Wednesday, October 10, 2012 8:52 AM To: solr-user@lucene.apache.org Subject: Faceted search question (Tokenizing) Hey There, We have the following data structure: - Person -- Interest 1 --- Subinterest 1 --- Subinterest 1 Description --- Subinterest 1 ID -- Interest 2 --- Subinterest 2 --- Subinterest 2 Description --- Subinterest 2 ID . -- Interest 99 --- Subinterest 99 --- Subinterest 99 Description --- Subinterest 99 ID Interest, Subinterest, Subinterest Description and Subinterest IDs are all multiavlued fields. A person can have any number of subinterests,descriptions and IDS. How could we faced/search this based on this data structure? Right now we tokenized everything in a seperate multivalued column in the following fasion; |Interest='Interest 1',Subinterest='Subinterest 1',Subinterest='Another Subinterest 1',Description='Interest 1 Description',ID='Interest 1 ID'| |Interest='Interest 2',Description='Interest 2 Description',ID='Interest |2 ID'| I have a feeling like this is a wrong approach to this problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Faceted-search-question-Tokenizing-tp4012948.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query foreign language synonyms / words of equivalent meaning?
Hi, We are using google translate to do something like what you (onlinespending) want to do, so maybe it will help. During indexing, we store the searchable fields from documents into a fields named _en, _fr, _es, etc. So assuming we capture title and body from each document, the fields are (title_en, body_en), (title_fr, body_fr), etc, with their own analyzer chains. These documents come from a controlled source (ie not the web), so we know the language they are authored in. During searching, a custom component intercepts the client language and the query. The query is sent to google translate for language detection. The largest amount of docs in the corpus is english, so if the detected language is either english or the client language, then we call google translate again to find the translated query in the other (english or client) language. Another custom component constructs an OR query between the two languages one component of which is aimed at the _en field set and the other aimed at the _xx (client language) field set. -sujit On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote: As far as I know, there is no built-in functionality for language translation. I would propose to write one, but there are many many pitfalls. If you want to translate from one language to another you might have to know the starting language. Otherwise you get problems with translation. Not (german) - distress (english), affliction (english) - you might have words in one language which are stopwords in other language not - you don't have a one to one mapping, it's more like 1 to n+x toilette (french) - bathroom, rest room / restroom, powder room This are just two points which jump into my mind but there are tons of pitfalls. We use the solution of a multilingual thesaurus as synonym dictionary. http://en.wikipedia.org/wiki/Eurovoc It holds translations of 22 official languages of the European Union. So a search for europäischer währungsfonds gives also results with european monetary fund, fonds monétaire européen, ... Regards Bernd Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com: Hi, English is going to be the predominant language used in my documents, but there may be a spattering of words in other languages (such as Spanish or French). What I'd like is to initiate a query for something like bathroom for example and for Solr to return documents that not only contain bathroom but also baño (Spanish). And the same goes when searching for baño. I'd like Solr to return documents that contain either bathroom or baño. One possibility is to pre-translate all indexed documents to a common language, in this case English. And if someone were to search using a foreign word, I'd need to translate that to English before issuing a query to Solr. This appears to be problematic, since I'd have to know whether the indexed words and the query are even in a foreign language, which is not trivial. Another possibility is to pre-build a list of foreign word synonyms. So baño would be listed as a synonym for bathroom. But I'd need to include other languages (such as toilette in French) and other words. This requires that I know in advance all possible words I'd need to include foreign language versions of (not to mention needing to know which languages to include). This isn't trivial either. I'm assuming there's no built-in functionality that supports the foreign language translation on the fly, so what do people propose? Thanks! -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie Universitätsstr. 25 und Wissensmanagement 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
unsuscribe
unsuscribe
Re: add shard to index
That is what is being discussed already. The thing is, at present, Solr requires an even distribution of documents across shards, so you can't just add another shard, assign it to a hash range, and be done with it. The reason is down to the scoring mechanism used - TF/IDF (term frequency/inverse document frequency). The IDF portion says how many times does this term appear in the whole index? If there are only two documents in the index, then the IDF will be very different from when there are 2 million docs, resulting in different scores for equivalent documents based upon which shard they are in. Currently, the only solution to this is to distribute your documents evenly, which would mean, if you have four shards and you create a fifth, you'd need to send 1/4 of your documents from each shard to the new shard, which is not really a trivial task. I believe the JIRA ticket covering this was mentioned earlier in this thread. Upayavira On Mon, Oct 8, 2012, at 04:33 PM, Radim Kolar wrote: Do it as it is done in cassandra database. Adding new node and redistributing data can be done in live system without problem it looks like this: every cassandra node has key range assigned. instead of assigning keys to nodes like hash(key) mod nodes, then every node has its portion of hash keyspace. They do not need to be same, some node can have larger portion of keyspace then another. hash function max possible value is 12. shard1 - 1-4 shard2 - 5-8 shard3 - 9-12 now lets add new shard. In cassandra adding new shard by default cuts existing one by half, so you will have shard1 - 1-2 shard23-4 shard35-8 shard4 9-12 see? You needed to move only documents from old shard1. Usually you are adding more then 1 shard during reorganization, you do not need to rebalance cluster by moving every node into different position in hash keyspace that much.
Re: Wild card searching - well sort of
Have you looked at WordDelimiterFilterFactory that was mentioned earlier? Try a fieldType in the admin/analysis page that has WDFF as part of the analysis chain. It would do exactly what you've described so far. WDFF splits the input up as tokens on non-alphanum characters, alpha/num transitions and case transitions (you can configure these). Then searching will match these split-out tokens. Best Erick On Wed, Oct 10, 2012 at 10:28 AM, Kissue Kissue kissue...@gmail.com wrote: It is really not fixed. It could also be *-*-BAAN or BAAN-CAN20-*. In each i just want only the fixed character(s) to match then the * can match any character. On Wed, Oct 10, 2012 at 2:05 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Wed, 2012-10-10 at 14:15 +0200, Kissue Kissue wrote: I have added the string: *-BAAN-* to the index to a field called pattern which is a string type. Now i want to be able to search for A100-BAAN-C20 or ZA20-BAAN-300 and have Solr return *-BAAN-*. That sounds a lot like the problem presented in the thread Indexing wildcard patterns: http://web.archiveorange.com/archive/v/AAfXfcuIJY9BQJL3mjty The short answer is no, Solr does not support this in the general form. But maybe you can make it work anyway. In your example, the two queries A100-BAAN-C20 and ZA20-BAAN-300 share the form [4 random characters]-[4 significant characters]-[3 random characters] so a little bit of pre-processing would rewrite that to *-[4 significant characters]-* which would match *-BAAN-* If you describe the patterns and common elements to your indexed terms and to your queries, we might come up with something.
Re: Using additional dictionary with DirectSolrSpellChecker
On Wed, Oct 10, 2012 at 9:02 AM, O. Klein kl...@octoweb.nl wrote: I don't want to tweak the threshold. For majority of cases it works fine. It's for cases where term has low frequency but is spelled correctly. If you lower the threshold you would also get incorrect spelled terms as suggestions. Yeah there is no real magic here when the corpus contains typos. this existing docFreq heuristic was just borrowed from the old index-based spellchecker. I do wonder if using # of occurrences (totalTermFreq) instead of # of documents with the term (docFreq) would improve the heuristic. In all cases I think if you want to also integrate a dictionary or something, it seems like this could somehow be done with the File-based spellchecker?
Re: segment number during optimize of index
I have an other question, does the number of segment affect speed for update index? 2012/10/10 jame vaalet jamevaa...@gmail.com Guys, thanks for all the inputs, I was continuing my research to know more about segments in Lucene. Below are my conclusion, please correct me if am wrong. 1. Segments are independent sub-indexes in seperate file, while indexing its better to create new segment as it doesnt have to modify an existing file. where as while searching, smaller the segment the better it is since you open x (not exactly x but xn a value proportional to x) physical files to search if you have got x segments in the index. 2. since lucene has memory map concept, for each file/segment in index a new m-map file is created and mapped to the physcial file in disk. Can someone explain or correct this in detail, i am sure there are lot many people wondering how m-map works while you merge or optimze index segments. On 6 October 2012 07:41, Otis Gospodnetic otis.gospodne...@gmail.com wrote: If I were you and not knowing all your details... I would optimize indices that are static (not being modified) and would optimize down to 1 segment. I would do it when search traffic is low. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Fri, Oct 5, 2012 at 4:27 PM, jame vaalet jamevaa...@gmail.com wrote: Hi Eric, I am in a major dilemma with my index now. I have got 8 cores each around 300 GB in size and half of them are deleted documents in it and above that each has got around 100 segments as well. Do i issue a expungeDelete and allow the merge policy to take care of the segments or optimize them into single segment. Search performance is not at par compared to usual solr speed. If i have to optimize what segment number should i choose? my RAM size around 120 GB and JVM heap is around 45 GB (oldGen being 30 GB). Pleas advice ! thanks. On 6 October 2012 00:00, Erick Erickson erickerick...@gmail.com wrote: because eventually you'd run out of file handles. Imagine a long-running server with 100,000 segments. Totally unmanageable. I think shawn was emphasizing that RAM requirements don't depend on the number of segments. There are other resources that file consume however. Best Erick On Fri, Oct 5, 2012 at 1:08 PM, jame vaalet jamevaa...@gmail.com wrote: hi Shawn, thanks for the detailed explanation. I have got one doubt, you said it doesn matter how many segments index have but then why does solr has this merge policy which merges segments frequently? why can it leave the segments as it is rather than merging smaller one's into bigger one? thanks . On 5 October 2012 05:46, Shawn Heisey s...@elyograg.org wrote: On 10/4/2012 3:22 PM, jame vaalet wrote: so imagine i have merged the 150 Gb index into single segment, this would make a single segment of 150 GB in memory. When new docs are indexed it wouldn't alter this 150 Gb index unless i update or delete the older docs, right? will 150 Gb single segment have problem with memory swapping at OS level? Supplement to my previous reply: the real memory mentioned in the last paragraph does not include the memory that the OS uses to cache disk access. If more memory is needed and all the free memory is being used by the disk cache, the OS will throw away part of the disk cache (a near-instantaneous operation that should never involve disk I/O) and give that memory to the application that requests it. Here's a very good breakdown of how memory gets used with MMapDirectory in Solr. It's applicable to any program that uses memory mapping, not just Solr: http://java.dzone.com/**articles/use-lucene%E2%80%99s-**mmapdirectory http://java.dzone.com/articles/use-lucene%E2%80%99s-mmapdirectory Thanks, Shawn -- -JAME -- -JAME -- -JAME -- from Jun Wang
Re: Query foreign language synonyms / words of equivalent meaning?
I want an update processor that runs Translation Party. http://translationparty.com/ http://downloadsquad.switched.com/2009/08/14/translation-party-achieves-hilarious-results-using-google-transl/ - Original Message - | From: SUJIT PAL sujit@comcast.net | To: solr-user@lucene.apache.org | Sent: Wednesday, October 10, 2012 2:51:37 PM | Subject: Re: Query foreign language synonyms / words of equivalent meaning? | | Hi, | | We are using google translate to do something like what you | (onlinespending) want to do, so maybe it will help. | | During indexing, we store the searchable fields from documents into a | fields named _en, _fr, _es, etc. So assuming we capture title and | body from each document, the fields are (title_en, body_en), | (title_fr, body_fr), etc, with their own analyzer chains. These | documents come from a controlled source (ie not the web), so we know | the language they are authored in. | | During searching, a custom component intercepts the client language | and the query. The query is sent to google translate for language | detection. The largest amount of docs in the corpus is english, so | if the detected language is either english or the client language, | then we call google translate again to find the translated query in | the other (english or client) language. Another custom component | constructs an OR query between the two languages one component of | which is aimed at the _en field set and the other aimed at the _xx | (client language) field set. | | -sujit | | On Oct 9, 2012, at 11:24 PM, Bernd Fehling wrote: | | | As far as I know, there is no built-in functionality for language | translation. | I would propose to write one, but there are many many pitfalls. | If you want to translate from one language to another you might | have to | know the starting language. Otherwise you get problems with | translation. | | Not (german) - distress (english), affliction (english) | | - you might have words in one language which are stopwords in other | language not | - you don't have a one to one mapping, it's more like 1 to n+x | toilette (french) - bathroom, rest room / restroom, powder room | | This are just two points which jump into my mind but there are tons | of pitfalls. | | We use the solution of a multilingual thesaurus as synonym | dictionary. | http://en.wikipedia.org/wiki/Eurovoc | It holds translations of 22 official languages of the European | Union. | | So a search for europäischer währungsfonds gives also results | with | european monetary fund, fonds monétaire européen, ... | | Regards | Bernd | | | | Am 10.10.2012 04:54, schrieb onlinespend...@gmail.com: | Hi, | | English is going to be the predominant language used in my | documents, but | there may be a spattering of words in other languages (such as | Spanish or | French). What I'd like is to initiate a query for something like | bathroom | for example and for Solr to return documents that not only contain | bathroom but also baño (Spanish). And the same goes when | searching for | baño. I'd like Solr to return documents that contain either | bathroom or | baño. | | One possibility is to pre-translate all indexed documents to a | common | language, in this case English. And if someone were to search | using a | foreign word, I'd need to translate that to English before issuing | a query | to Solr. This appears to be problematic, since I'd have to know | whether the | indexed words and the query are even in a foreign language, which | is not | trivial. | | Another possibility is to pre-build a list of foreign word | synonyms. So baño | would be listed as a synonym for bathroom. But I'd need to include | other | languages (such as toilette in French) and other words. This | requires that | I know in advance all possible words I'd need to include foreign | language | versions of (not to mention needing to know which languages to | include). | This isn't trivial either. | | I'm assuming there's no built-in functionality that supports the | foreign | language translation on the fly, so what do people propose? | | Thanks! | | | -- | * | Bernd FehlingUniversitätsbibliothek Bielefeld | Dipl.-Inform. (FH)LibTec - Bibliothekstechnologie | Universitätsstr. 25 und Wissensmanagement | 33615 Bielefeld | Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de | | BASE - Bielefeld Academic Search Engine - www.base-search.net | * | |
Re: Auto Correction?
so other than commercial solutions, it seems like i need to have plugin right? i couldnt find any open source solutions yet... - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Auto-Correction-tp4012666p4013044.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Using additional dictionary with DirectSolrSpellChecker
Hapax legomena (terms with DF of 1) are very often typos. You can automatically build a stopword file from these. If you want to be picky, you can use only words with a very small distance from words with much larger DF. - Original Message - | From: Robert Muir rcm...@gmail.com | To: solr-user@lucene.apache.org | Sent: Wednesday, October 10, 2012 5:40:23 PM | Subject: Re: Using additional dictionary with DirectSolrSpellChecker | | On Wed, Oct 10, 2012 at 9:02 AM, O. Klein kl...@octoweb.nl wrote: | I don't want to tweak the threshold. For majority of cases it works | fine. | | It's for cases where term has low frequency but is spelled | correctly. | | If you lower the threshold you would also get incorrect spelled | terms as | suggestions. | | | Yeah there is no real magic here when the corpus contains typos. this | existing docFreq heuristic was just borrowed from the old index-based | spellchecker. | | I do wonder if using # of occurrences (totalTermFreq) instead of # of | documents with the term (docFreq) would improve the heuristic. | | In all cases I think if you want to also integrate a dictionary or | something, it seems like this could somehow be done with the | File-based spellchecker? |
Re: segment number during optimize of index
Study index merging. This is awesome. http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html Jame- opening lots of segments is not a problem. A major performance problem you will find is 'Large Pages'. This is an operating-system strategy for managing servers with 10s of gigabytes of memory. Without it, all large programs run much more slowly than they could. It is not a Solr or JVM problem. - Original Message - | From: jun Wang wangjun...@gmail.com | To: solr-user@lucene.apache.org | Sent: Wednesday, October 10, 2012 6:36:09 PM | Subject: Re: segment number during optimize of index | | I have an other question, does the number of segment affect speed for | update index? | | 2012/10/10 jame vaalet jamevaa...@gmail.com | | Guys, | thanks for all the inputs, I was continuing my research to know | more about | segments in Lucene. Below are my conclusion, please correct me if | am wrong. | | 1. Segments are independent sub-indexes in seperate file, while | indexing | its better to create new segment as it doesnt have to modify an | existing | file. where as while searching, smaller the segment the better | it is | since | you open x (not exactly x but xn a value proportional to x) | physical | files | to search if you have got x segments in the index. | 2. since lucene has memory map concept, for each file/segment in | index a | new m-map file is created and mapped to the physcial file in | disk. Can | someone explain or correct this in detail, i am sure there are | lot many | people wondering how m-map works while you merge or optimze | index | segments. | | | | On 6 October 2012 07:41, Otis Gospodnetic | otis.gospodne...@gmail.com | wrote: | | If I were you and not knowing all your details... | | I would optimize indices that are static (not being modified) and | would optimize down to 1 segment. | I would do it when search traffic is low. | | Otis | -- | Search Analytics - | http://sematext.com/search-analytics/index.html | Performance Monitoring - http://sematext.com/spm/index.html | | | On Fri, Oct 5, 2012 at 4:27 PM, jame vaalet | jamevaa...@gmail.com | wrote: |Hi Eric, |I am in a major dilemma with my index now. I have got 8 cores |each | around |300 GB in size and half of them are deleted documents in it and |above | that |each has got around 100 segments as well. Do i issue a |expungeDelete | and |allow the merge policy to take care of the segments or optimize |them | into |single segment. Search performance is not at par compared to |usual solr |speed. |If i have to optimize what segment number should i choose? my |RAM size |around 120 GB and JVM heap is around 45 GB (oldGen being 30 |GB). Pleas |advice ! | |thanks. | | |On 6 October 2012 00:00, Erick Erickson |erickerick...@gmail.com | wrote: | |because eventually you'd run out of file handles. Imagine a |long-running server with 100,000 segments. Totally |unmanageable. | |I think shawn was emphasizing that RAM requirements don't |depend on the number of segments. There are other |resources that file consume however. | |Best |Erick | |On Fri, Oct 5, 2012 at 1:08 PM, jame vaalet |jamevaa...@gmail.com | wrote: | hi Shawn, | thanks for the detailed explanation. | I have got one doubt, you said it doesn matter how many | segments | index |have | but then why does solr has this merge policy which merges | segments | frequently? why can it leave the segments as it is rather | than | merging | smaller one's into bigger one? | | thanks | . | | On 5 October 2012 05:46, Shawn Heisey s...@elyograg.org | wrote: | | On 10/4/2012 3:22 PM, jame vaalet wrote: | | so imagine i have merged the 150 Gb index into single | segment, | this |would | make a single segment of 150 GB in memory. When new docs | are | indexed it | wouldn't alter this 150 Gb index unless i update or delete | the | older |docs, | right? will 150 Gb single segment have problem with memory | swapping | at |OS | level? | | | Supplement to my previous reply: the real memory mentioned | in the | last | paragraph does not include the memory that the OS uses to | cache | disk | access. If more memory is needed and all the free memory | is being | used |by | the disk cache, the OS will throw away part of the disk | cache (a | near-instantaneous operation that should never involve disk | I/O) | and |give | that memory to the application that requests it. | | Here's a very good breakdown of how memory gets used with | MMapDirectory |in | Solr. It's applicable to any program that uses