Performance Drop from 1.3 to 1.4
Hello, We recently began migrating a few of our applications from 1.3 to 1.4 in order to take advantage of the replication and performance improvements. In practice however, we are noticing that our instances which make use of LocalSolr have experienced some performance degradation from 1.3 to 1.4. This mostly appears to be due to some dramatically different GC patterns between the two despite using the same JDK, data set, and GC parameters, and solr cache sizes. Versions == LocalSolr: HEAD as of August 27 Solr: 1.4-DEV from 6/10/09 JDK: Sun Hotspot 1.6.0_14 Has anyone else seen this type of behavior with Solr/LocalSolr queries? Some graphs of our results (transactions per second) between both versions are available at the following link. While performance is relatively similar when using UseParallelGC rather than UseConcMarkSweepGC, we do still notice some relatively long pause times in 1.4 when compared with 1.3 during GC windows. https://dl.getdropbox.com/u/162474/performance.png The queries run for this test make use of LocalSolr sorting and filtering based on distances between 2 sets of lat/long coordinates. We are currently using LocalSolr HEAD with 6/10/09 revision of Solr 1.4. The reason we selected that revision was due to it being the revision that LocalSolr was last released against. We do however plan to spend some time this week testing newer builds of Solr 1.4 to see if similar behavior exists. Regards, Ilan -- Ilan Rabinovitch i...@fonz.net
Re: solr and approximate string matching
Hi, On Sun, Aug 30, 2009 at 9:32 PM, Shalin Shekhar Mangarshalinman...@gmail.com wrote: The best way to debug these kind of problems is to look at analysis.jsp and/or use debugQuery=on on the query to see exactly how it is being parsed. Can you post the output of your query with debugQuery=on? Thanks a lot for your answer. Fortunately, I've managed to deal with the problem by myself, and it turned out to be mostly unrelated with the schema. I was using AND as the default operator, and that didn't play nicely with ngrams. -- RS -- http://gryziemy.net http://robimy.net
sql server indexing using dih problem
Hi, I am trying to index sql server table using dih. my data-config.xml file configuration: dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver type=JdbcDataSource url=jdbc:sqlserver://10.232.6.38:1433;databaseName=Rames user=sa password=password-1 / document name=customers entity name=customers query=select CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from customers field column=CustomerID name=CustomerID/ field column=Title name=Title/ field column=Forename name=Forename/ field column=Surname name=Surname/ field column=Address_1 name=Address_1/ field column=Address_2 name=Address_2/ field column=Town name=Town/ field column=Postcode name=Postcode/ /entity /document /dataConfig When i have tried to debug i got the following error: ?xml version=1.0 encoding=UTF-8 ? - response - lst name=responseHeader int name=status0/int int name=QTime29672/int /lst - lst name=initArgs - lst name=defaults str name=configdb-data-config.xml/str /lst /lst str name=commandfull-import/str str name=modedebug/str null name=documents / - lst name=verbose-output - lst name=entity:customers - lst name=document#1 str name=queryselect CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from customers/str str name=EXCEPTIONorg.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select CustomerID,Title,Forename,Surname,Address_1,Address_2,Town,Postcode from customers Processing Document # 1 at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:186) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:143) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:43) at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:183) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:74) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:285) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:178) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:136) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:386) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:190) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host 10.232.6.38, port 1433 has failed. Error: Connection refused: connect. Verify the connection properties, check that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port, and that no firewall is blocking TCP connections to the port.. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1049) at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:833) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:716) at
How to set similarity to catch more results ?
Hello, I'm new to Solr and don't find in documentation how-to to set similarity. I want it more flexible, as if I make a mistake with letters, results are found like with google. Thank you in advance.
Re: How to set similarity to catch more results ?
There are fuzzy searches which might be able to help a bit. There could be more but I am just a newbie. Regards Rajan On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote: Hello, I'm new to Solr and don't find in documentation how-to to set similarity. I want it more flexible, as if I make a mistake with letters, results are found like with google. Thank you in advance.
Re: sql server indexing using dih problem
On Mon, Aug 31, 2009 at 3:25 PM, rameshgalla ramesh.ga...@cognizant.comwrote: Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host 10.232.6.38, port 1433 has failed. Error: Connection refused: connect. Verify the connection properties, check that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port, and that no firewall is blocking TCP connections to the port.. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170) at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1049) at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:833) at The reason is given in the exception itself. The driver is not able to connect to your server at port 1433. Either your server is down or the host/port is incorrect or there is a firewall which is blocking access. -- Regards, Shalin Shekhar Mangar.
Hierarchical schema design
Hi all, Is there a possibility to have a hierarchical schema in solr, meaning can we have objects under objects. For example, for a doc like: doc a1 b1 b2 b3 /a1 a2 b1 b2 ,b3 /a2 . . . . . . . /doc I need to make schema with 3 types of such objects and all of them having different field values for each. Please reply if there exists such a possibility. Regards. Pooja
Re: filtering facets
Hi Olivier, are the facet counts on the urls you dont want 0? if so you can use facet.mincount to only return results greater than 0. -Mike Olivier H. Beauchesne wrote: Hi, Long time lurker, first time poster. I have a multi-valued field, let's call it article_outlinks containing all outgoing urls from a document. I want to get all matching urls sorted by counts. For exemple, I want to get all outgoing wikipedia url in my documents sorted by counts. So I execute a query like this: q=article_outlinks:http*wikipedia.org* and I facet on article_outlinks But I get facets containing the other urls in the documents. I can get something close by using facet.prefix=http://en.wikipedia.org but I want to include other subdomains on wikipedia (ex: fr.wikipedia.org). Is there a way to do a search and getting facets only matching my query? I know facet.prefix isn't a query, but is there a way to get that behavior? Is it easy to extend solr to do something like that? Thank you, Olivier Sorry for my english. -- my public key can be found by: gpg --keyserver pgp.mit.edu --recv-keys 26A5C87F
Re: Performance Drop from 1.3 to 1.4
I don't know exactly how the local solr stuff currently works (it's not currently part of Solr), but it's possible to get worse memory performance if you're not careful. Solr and Lucene now do per-segment searching and sorting in a single index... and that means fieldcache entries populated at the segment level instead of the top level multireader. It's possible/probably that some elements of LocalSolr use a top-level reader and other elements use per-segment (via Lucene or Solr) for geo fields, thus doubling the memory footprint from before. -Yonik http://www.lucidimagination.com On Mon, Aug 31, 2009 at 3:24 AM, Ilan Rabinovitchi...@fonz.net wrote: Hello, We recently began migrating a few of our applications from 1.3 to 1.4 in order to take advantage of the replication and performance improvements. In practice however, we are noticing that our instances which make use of LocalSolr have experienced some performance degradation from 1.3 to 1.4. This mostly appears to be due to some dramatically different GC patterns between the two despite using the same JDK, data set, and GC parameters, and solr cache sizes. Versions == LocalSolr: HEAD as of August 27 Solr: 1.4-DEV from 6/10/09 JDK: Sun Hotspot 1.6.0_14 Has anyone else seen this type of behavior with Solr/LocalSolr queries? Some graphs of our results (transactions per second) between both versions are available at the following link. While performance is relatively similar when using UseParallelGC rather than UseConcMarkSweepGC, we do still notice some relatively long pause times in 1.4 when compared with 1.3 during GC windows. https://dl.getdropbox.com/u/162474/performance.png The queries run for this test make use of LocalSolr sorting and filtering based on distances between 2 sets of lat/long coordinates. We are currently using LocalSolr HEAD with 6/10/09 revision of Solr 1.4. The reason we selected that revision was due to it being the revision that LocalSolr was last released against. We do however plan to spend some time this week testing newer builds of Solr 1.4 to see if similar behavior exists. Regards, Ilan -- Ilan Rabinovitch i...@fonz.net
Re: filtering facets
Hi Mike, No, my problem is that the field article_outlinks is multivalued thus it contains several urls not related to my search. I would like to facet only urls matching my query. For exemple(only on one document, but my search targets over 1M docs): Doc1: article_url: url1.com/1 url2.com/2 url1.com/1 url1.com/3 And my query is: article_url:url1.com* and I facet by article_url and I want it to give me: url1.com/1 (2) url1.com/3 (1) But right now, because url2.com/2 is contained in a multivalued field with the matching urls, I get this: url1.com/1 (2) url1.com/3 (1) url2.com/2 (1) I can use facet.prefix to filter, but it's not very flexible if my url contains a subdomain as facet.prefix doesn't support wildcards. Thank you, Olivier Mike Topper a écrit : Hi Olivier, are the facet counts on the urls you dont want 0? if so you can use facet.mincount to only return results greater than 0. -Mike Olivier H. Beauchesne wrote: Hi, Long time lurker, first time poster. I have a multi-valued field, let's call it article_outlinks containing all outgoing urls from a document. I want to get all matching urls sorted by counts. For exemple, I want to get all outgoing wikipedia url in my documents sorted by counts. So I execute a query like this: q=article_outlinks:http*wikipedia.org* and I facet on article_outlinks But I get facets containing the other urls in the documents. I can get something close by using facet.prefix=http://en.wikipedia.org but I want to include other subdomains on wikipedia (ex: fr.wikipedia.org). Is there a way to do a search and getting facets only matching my query? I know facet.prefix isn't a query, but is there a way to get that behavior? Is it easy to extend solr to do something like that? Thank you, Olivier Sorry for my english.
Help! Issue with tokens in custom synonym filter
Hi all, I've been writing some custom synonym filters and have run into an issue with returning a list of tokens. I have a synonym filter that uses the WordNet database to extract synonyms. My problem is how to define the offsets and position increments in the new Tokens I'm returning. For an input token, I get a list of synonyms from the WordNet database. I then create a ListToken of those results. Each Token is created with the same startOffset, endOffset and positionIncrement of the input Token. Is this correct? My understanding from looking at the Lucene codebase is that the startOffset/endOffset should be the same, as we are referring to the same term in the original text. However, I don't quite get the positionIncrement. I understand that it is relative to the previous term ... does this mean all my synonyms should have a positionIncrement of 0? But whether I use 0 or the positionIncrement of the original input Token, Solr seems to ignore the returned tokens ... This is a summary of what is in my filter: * private IteratorToken output; private ArrayListToken synonyms = null; public Token next(Token in) throws IOException { if (output != null) { // Here we are just outputing matched synonyms // that we previously created from the input token // The input token has already been returned if (output.hasNext()) { return output.next(); } else { return null; } } synonyms = new ArrayListToken(); Token t = input.next(in); if (t == null) return null; String value = new String(t.termBuffer(), 0, t.termLength()).toLowerCase(); // Get list of WordNet synonyms (code removed) // Iterate thru WordNet synonyms for (String wordNetSyn : wordNetSyns) { Token synonym = new Token(t.startOffset(), t.endOffset(), t.type()); synonym.setPositionIncrement(t.getPositionIncrement()); synonym.setTermBuffer(wordNetSyn .toCharArray(), 0, wordNetSyn .length()); synonyms.add(synonym); } output = synonyms.iterator(); // Return the original word, we want it return t; }
Re: Help! Issue with tokens in custom synonym filter
I've been writing some custom synonym filters and have run into an issue with returning a list of tokens. I have a synonym filter that uses the WordNet database to extract synonyms. My problem is how to define the offsets and position increments in the new Tokens I'm returning. For an input token, I get a list of synonyms from the WordNet database. I then create a ListToken of those results. Each Token is created with the same startOffset, endOffset and positionIncrement of the input Token. Is this correct? My understanding from looking at the Lucene codebase is that the startOffset/endOffset should be the same, as we are referring to the same term in the original text. However, I don't quite get the positionIncrement. I understand that it is relative to the previous term ... does this mean all my synonyms should have a positionIncrement of 0? But whether I use 0 or the positionIncrement of the original input Token, Solr seems to ignore the returned tokens ... You can look at the source code of SynonymTokenFilter[1] and SynonymMap[2] in Lucene. [1] http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymTokenFilter.html [2] http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymMap.html
Is caching worth it when my whole index is in RAM?
Hi, If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I have a need for the document cache? Or should I set it to 0 items, because pulling field values from an index in RAM is so fast that the document cache would be a duplication of effort? Are there any other caches that I should turn off if I can get my entire index in RAM? Filter cache, query results cache, etc? Thanks! Michael
Re: Help! Issue with tokens in custom synonym filter
Although this is not a direct answer to your question, you may want to consider generating a synonyms file from wordnet. Then, you can use the standard synonym filter in Solr. The only downside to this is that the synonym file might be pretty large... but you've probably got some large file for wordnet data any way. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server On 8/31/09 10:32 AM, Lajos la...@protulae.com wrote: Hi all, I've been writing some custom synonym filters and have run into an issue with returning a list of tokens. I have a synonym filter that uses the WordNet database to extract synonyms. My problem is how to define the offsets and position increments in the new Tokens I'm returning. For an input token, I get a list of synonyms from the WordNet database. I then create a ListToken of those results. Each Token is created with the same startOffset, endOffset and positionIncrement of the input Token. Is this correct? My understanding from looking at the Lucene codebase is that the startOffset/endOffset should be the same, as we are referring to the same term in the original text. However, I don't quite get the positionIncrement. I understand that it is relative to the previous term ... does this mean all my synonyms should have a positionIncrement of 0? But whether I use 0 or the positionIncrement of the original input Token, Solr seems to ignore the returned tokens ... This is a summary of what is in my filter: * private IteratorToken output; private ArrayListToken synonyms = null; public Token next(Token in) throws IOException { if (output != null) { // Here we are just outputing matched synonyms // that we previously created from the input token // The input token has already been returned if (output.hasNext()) { return output.next(); } else { return null; } } synonyms = new ArrayListToken(); Token t = input.next(in); if (t == null) return null; String value = new String(t.termBuffer(), 0, t.termLength()).toLowerCase(); // Get list of WordNet synonyms (code removed) // Iterate thru WordNet synonyms for (String wordNetSyn : wordNetSyns) { Token synonym = new Token(t.startOffset(), t.endOffset(), t.type()); synonym.setPositionIncrement(t.getPositionIncrement()); synonym.setTermBuffer(wordNetSyn .toCharArray(), 0, wordNetSyn .length()); synonyms.add(synonym); } output = synonyms.iterator(); // Return the original word, we want it return t; }
Re: filtering facets
You could post-process the response and remove urls that don't match your domain pattern. On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.comwrote: Hi Mike, No, my problem is that the field article_outlinks is multivalued thus it contains several urls not related to my search. I would like to facet only urls matching my query. For exemple(only on one document, but my search targets over 1M docs): Doc1: article_url: url1.com/1 url2.com/2 url1.com/1 url1.com/3 And my query is: article_url:url1.com* and I facet by article_url and I want it to give me: url1.com/1 (2) url1.com/3 (1) But right now, because url2.com/2 is contained in a multivalued field with the matching urls, I get this: url1.com/1 (2) url1.com/3 (1) url2.com/2 (1) I can use facet.prefix to filter, but it's not very flexible if my url contains a subdomain as facet.prefix doesn't support wildcards. Thank you, Olivier Mike Topper a écrit : Hi Olivier, are the facet counts on the urls you dont want 0? if so you can use facet.mincount to only return results greater than 0. -Mike Olivier H. Beauchesne wrote: Hi, Long time lurker, first time poster. I have a multi-valued field, let's call it article_outlinks containing all outgoing urls from a document. I want to get all matching urls sorted by counts. For exemple, I want to get all outgoing wikipedia url in my documents sorted by counts. So I execute a query like this: q=article_outlinks:http*wikipedia.org* and I facet on article_outlinks But I get facets containing the other urls in the documents. I can get something close by using facet.prefix=http://en.wikipedia.org but I want to include other subdomains on wikipedia (ex: fr.wikipedia.org). Is there a way to do a search and getting facets only matching my query? I know facet.prefix isn't a query, but is there a way to get that behavior? Is it easy to extend solr to do something like that? Thank you, Olivier Sorry for my english.
Re: Hierarchical schema design
Hi, The search index is flat. There are no hierarchies in there. Now, I'm not sure what you're referring to with this type of objects. But if you refer to having different types of documents in one index (and schema) that's certainly possible. You can define all the fields that you expect in all the different document types in one schema and have one special field (called type) to distinguish between the document types (it will hold a unique value for each document type). The only drawback of this solutions is that you cannot (in most cases) define the fields as required. Another solution would be to deploy the different documents on different core, where each core has its own schema (and index). The drawback here however is that you will not be able to search across the different document types. Cheers, Uri Pooja Verlani wrote: Hi all, Is there a possibility to have a hierarchical schema in solr, meaning can we have objects under objects. For example, for a doc like: doc a1 b1 b2 b3 /a1 a2 b1 b2 ,b3 /a2 . . . . . . . /doc I need to make schema with 3 types of such objects and all of them having different field values for each. Please reply if there exists such a possibility. Regards. Pooja
Re: How to set similarity to catch more results ?
hi Kaoul, There are multiple ways that you can use to get the desired results. - Stemming - this makes all forms of a word (e.g. Run, Running, Runner) match to its stem or root word Run. - Synonyms - this will take a list of synonyms from you and would match veg = vegetarian and even tiger = lion if you map so. - PhoneticFilterFactory - As the name suggests, it would do all your soundex matches. Apart from these FilterFactories, using a StandardTokenizer would match mickey mouse to mouse mickey, as you would expect from google. There are still tens of other Filters and Tokenizers that you can use, depending on your need. I would suggest you to go through http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to get more understanding of available options. regards, aakash On Mon, Aug 31, 2009 at 3:33 PM, rajan chandi chandi.ra...@gmail.comwrote: There are fuzzy searches which might be able to help a bit. There could be more but I am just a newbie. Regards Rajan On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote: Hello, I'm new to Solr and don't find in documentation how-to to set similarity. I want it more flexible, as if I make a mistake with letters, results are found like with google. Thank you in advance.
Re: Release Date Solr 1.4
Many of you probably know that Lucene went into code-freeze last Thursday... which puts a probable Lucene release date at the end of this week. My day-job colleagues and I are all traveling this week (company get-together) so that may slow things down a bit for some of us, and perhaps cause the goal of releasing Solr 1 week after Lucene to slip a little. Still, if there are any issues that are assigned to you, and that you can't get to this week (including the weekend) please un-assign yourself as a signal that someone else should try and take it up. -Yonik http://www.lucidimagination.com ps: I've extended my stay so I can make the Lucene/Solr meetup this Thursday... hope to see some of you there! On Fri, Aug 21, 2009 at 11:34 PM, Yonik Seeleyyo...@lucidimagination.com wrote: FYI, I'm on vacation in Ocean City MD starting tomorrow - but I will have internet access. The goal of releasing a week after 2.9 still seems very realistic - we just need to decide to finish all open issues one week from Lucene's code freeze. And all of a sudden, Lucene went from 0 open issues, back to 16... but most of those may be resolved rapidly. -Yonik http://www.lucidimagination.com On Tue, Aug 18, 2009 at 1:37 PM, Yonik Seeleyyo...@lucidimagination.com wrote: On Tue, Aug 18, 2009 at 9:02 AM, Mark Millermarkrmil...@gmail.com wrote: The last note I saw said we hope to release 1.4 a week or so after Lucene 2.9 (though of course a week may not end up being enough). Yep, I think this is still doable. -Yonik http://www.lucidimagination.com
Re: Dismax Wildcard Queries
Hi Kurt. I'm the author of those JIRA issues. I'm glad you have interest in them. Please vote for them if you have not done so already. I updated SOLR-758 and I hope it works out okay for you. If you have further questions, please comment on the relevant issues. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server On 8/30/09 3:21 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Tue, Aug 25, 2009 at 3:00 AM, Kurt N. kurt.nordst...@unt.edu wrote: Hello all. We have a situation in the requirements for our project that make it desireable to be able to perform a DisMax query with wildcard (* and ?) characters in it. We are using the standard release (not nightly) of Solr 1.3. Our first thought was to apply the SOLR-756 patch (http://issues.apache.org/jira/browse/SOLR-756), which aimed to make DisMax support all query types. The patch was installed, and Solr recompiled without trouble. Upon passing a query with the * wildcard in it, we did not get a result set that indicated that the query was working. Our next thought was to apply SOLR-758 (http://issues.apache.org/jira/browse/SOLR-758), to see if that solved our problem. In doing so, we had to install the SOLR-757 (http://issues.apache.org/jira/browse/SOLR-757) patch as well. Unfortunately, at this point, Solr refused to compile. From the error messages that Ant gave, it seemed that the new code from SOLR-758 was looking for a function called getNonLocalParams(), which, after grep'ing the source, doesn't seem to exist in the Solr codebase. I can't find QParser ever having a method named getNonLocalParams. However, looking at the way that method is being used in the patch, using params instead of getNonLocalParams() should work. Questions on patches are best asked on the respective issue. -- Regards, Shalin Shekhar Mangar.
Re: Why can't have sign in the text?
I use text as my field type, but whenever my field has '' sign, the post.jar will error out. What can I do to work around this? Thanks. The files - that you are posting - must be valid xml. Escape special xml characters, e.g. replace with amp;
Re: WordDelimiterFilter to QueryParser to MultiPhraseQuery?
This is mostly my misunderstanding of catenateAll=1 as I thought it would break down with an OR using the full concatenated word. Thus: Jokers Wild - { jokers, wild } OR { jokerswild } But really it becomes: { jokers, {wild, jokerswild}} which will not match. And if you have a mistyped camel case like: jOkerswild - { j, {okerswild, jokerswild}} again no match. So it really requires some way to append the full word as an OR so: {j, {okerswild}} OR {jokerswild} severalTokensAtSamePosition=true is in the source code (QueryParser) as a boolean flag which always ends up true in these cases and triggers creating a MultiPhraseQuery as in the examples above. To really get this right I'll need to do a custom QueryParser IMO.
Date Faceting and Double Counting
If we do date faceting and start at 2009-01-01T00:00:00Z, end at 2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at exactly 2009-01-02T00:00:00Z will be included in both the returned counts (2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z). At the moment, this is quite bad for us, as we only index the day-level, so all of our documents are exactly on the line between each facet-range. Because we know our data is indexed as being exactly at midnight each day, I think we can simply always start from 1 second prior and get the results we want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think this problem would affect everyone, even if usually more subtly (instead of all documents being counted twice, only a few on the fencepost between ranges). Is this a known behavior people are happy with, or should I file an issue asking for ranges in date-facets to be constructed to subtract one second from the end of each range (so that the effective range queries for my case would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] [2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])? Alternatively, is there some other suggested way of using the date faceting to avoid this problem? -- Stephen Duncan Jr www.stephenduncanjr.com
Re: Help! Issue with tokens in custom synonym filter
Hi David Ahmet, I hadn't seen the SynonymTokenFilter from Lucene, so that helped. Ultimately, however, it seems I was pretty much doing the right thing, although my token type might have been wrong. Unfortunately, while the tokens are being returned properly (AFAIK), when I do a query using one of the synonyms, I can't get any results. This is not the case if I just directly code in the synonym into the synonyms file with the standard solr synonym filter. So I'll have to keep on hacking away ;) Regarding generating the file from WordNet, we'd considered that but our requirements essentially mean we have to do the heavy lifting within the filter itself. Not that I'm opposed, it is just that I'm apparently missing something simple still. Thanks for the replies. Lajos Smiley, David W. wrote: Although this is not a direct answer to your question, you may want to consider generating a synonyms file from wordnet. Then, you can use the standard synonym filter in Solr. The only downside to this is that the synonym file might be pretty large... but you've probably got some large file for wordnet data any way. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server On 8/31/09 10:32 AM, Lajos la...@protulae.com wrote: Hi all, I've been writing some custom synonym filters and have run into an issue with returning a list of tokens. I have a synonym filter that uses the WordNet database to extract synonyms. My problem is how to define the offsets and position increments in the new Tokens I'm returning. For an input token, I get a list of synonyms from the WordNet database. I then create a ListToken of those results. Each Token is created with the same startOffset, endOffset and positionIncrement of the input Token. Is this correct? My understanding from looking at the Lucene codebase is that the startOffset/endOffset should be the same, as we are referring to the same term in the original text. However, I don't quite get the positionIncrement. I understand that it is relative to the previous term ... does this mean all my synonyms should have a positionIncrement of 0? But whether I use 0 or the positionIncrement of the original input Token, Solr seems to ignore the returned tokens ... This is a summary of what is in my filter: * private IteratorToken output; private ArrayListToken synonyms = null; public Token next(Token in) throws IOException { if (output != null) { // Here we are just outputing matched synonyms // that we previously created from the input token // The input token has already been returned if (output.hasNext()) { return output.next(); } else { return null; } } synonyms = new ArrayListToken(); Token t = input.next(in); if (t == null) return null; String value = new String(t.termBuffer(), 0, t.termLength()).toLowerCase(); // Get list of WordNet synonyms (code removed) // Iterate thru WordNet synonyms for (String wordNetSyn : wordNetSyns) { Token synonym = new Token(t.startOffset(), t.endOffset(), t.type()); synonym.setPositionIncrement(t.getPositionIncrement()); synonym.setTermBuffer(wordNetSyn .toCharArray(), 0, wordNetSyn .length()); synonyms.add(synonym); } output = synonyms.iterator(); // Return the original word, we want it return t; } No virus found in this incoming message. Checked by AVG - www.avg.com Version: 8.5.409 / Virus Database: 270.13.71/2334 - Release Date: 08/29/09 17:51:00
Why can't have sign in the text?
Hi, I use text as my field type, but whenever my field has '' sign, the post.jar will error out. What can I do to work around this? Thanks. solr returned an error: comctcwstxexcWstxLazyException_Unexpected_character___code_32_missing_name__ at_javaxxmlstreamSerializableLocation587f587f__comctcwstxexcWstxLazyException_comctcwstxexcWstxUnexpectedCharException_ Unexpected_character___code_32_missing_name__ at_javaxxmlstreamSerializableLocation587f587f__ at_comctcwstxexcWstxLazyExceptionthrowLazilyWstxLazyExceptionjava45__ at_comctcwstxsrStreamScannerthrowLazyErrorStreamScannerjava729__ at_comctcwstxsrBasicStreamReadersafeFinishTokenBasicStreamReaderjava3659__ at_comctcwstxsrBasicStreamReadergetTextBasicStreamReaderjava809__ at_orgapachesolrhandlerXmlUpdateRequestHandlerreadDocXmlUpdateRequestHandlerjava327__ . fieldType name=mytext class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Please advise. Elaine
Re: Why can't have sign in the text?
Thanks a lot! Really helped. On Mon, Aug 31, 2009 at 2:21 PM, AHMET ARSLANiori...@yahoo.com wrote: I use text as my field type, but whenever my field has '' sign, the post.jar will error out. What can I do to work around this? Thanks. The files - that you are posting - must be valid xml. Escape special xml characters, e.g. replace with
Re: filtering facets
yeah, but then I would have to retrieve *a lot* of facets. I think for now i'll retrieve all the subdomains with facet.prefix and then merge those queries. Not ideal, but when I will have more motivation, I will submit a patch to solr :-) Michael a écrit : You could post-process the response and remove urls that don't match your domain pattern. On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.comwrote: Hi Mike, No, my problem is that the field article_outlinks is multivalued thus it contains several urls not related to my search. I would like to facet only urls matching my query. For exemple(only on one document, but my search targets over 1M docs): Doc1: article_url: url1.com/1 url2.com/2 url1.com/1 url1.com/3 And my query is: article_url:url1.com* and I facet by article_url and I want it to give me: url1.com/1 (2) url1.com/3 (1) But right now, because url2.com/2 is contained in a multivalued field with the matching urls, I get this: url1.com/1 (2) url1.com/3 (1) url2.com/2 (1) I can use facet.prefix to filter, but it's not very flexible if my url contains a subdomain as facet.prefix doesn't support wildcards. Thank you, Olivier Mike Topper a écrit : Hi Olivier, are the facet counts on the urls you dont want 0? if so you can use facet.mincount to only return results greater than 0. -Mike Olivier H. Beauchesne wrote: Hi, Long time lurker, first time poster. I have a multi-valued field, let's call it article_outlinks containing all outgoing urls from a document. I want to get all matching urls sorted by counts. For exemple, I want to get all outgoing wikipedia url in my documents sorted by counts. So I execute a query like this: q=article_outlinks:http*wikipedia.org* and I facet on article_outlinks But I get facets containing the other urls in the documents. I can get something close by using facet.prefix=http://en.wikipedia.org but I want to include other subdomains on wikipedia (ex: fr.wikipedia.org). Is there a way to do a search and getting facets only matching my query? I know facet.prefix isn't a query, but is there a way to get that behavior? Is it easy to extend solr to do something like that? Thank you, Olivier Sorry for my english.
Re: Sorting by Unindexed Fields
Hi Erik, Sorry it took me a while to get back to your response. I appreciate any help I can get. The number of documents will start out small, but if we do well we'll have a lot. The fields would all be numeric (we'll map categorical fields to integers), and I would imagine the number of fields will be between 2 and 5, but we're not going to limit it. I think for this particular issue we may try to keep the solution in the database so that that particular information can live in as few places as possible. Thanks as always for the help. iSac On Wed, Aug 26, 2009 at 9:32 PM, Erik Hatcher erik.hatc...@gmail.comwrote: Solr sorts on indexed fields only, currently. And only a single value per document per sort field (careful with analyzed fields, and no multiValued fields). Unwise and impossible - of course this depends on the scale you're speaking of. How many documents? What types of fields? How small is fairly small number of fields? Erik On Aug 26, 2009, at 6:33 PM, Isaac Foster wrote: Hi, I have a situation where a particular kind of document can be categorized in different ways, and depending on the categories it is in it will have different fields that describe it (in practice the number of fields will be fairly small, but whatever). These documents will each have a full-text field that Solr is perfect for, and it seems like Solr's dynamic fields ability makes it an even more perfect solution. I'd like to be able to sort by any of the fields, but indexing them all seems somewhere between unwise and impossible. Will Solr sort by fields that are unindexed? iSac
Re: Field names with whitespaces
This seems to work: ?q=field\ name:something Probably not a good idea to have field names with whitespace though. -Jay 2009/8/28 Marcin Kuptel marcinkup...@gmail.com Hi, Is there a way to query solr about fields which names contain whitespaces? Indexing such data does not cause any problems but I have been unable to retrieve it. Regards, Marcin Kuptel
Re: How to set similarity to catch more results ?
I want it more flexible, as if I make a mistake with letters, results are found like with google. You are talking about spelling mistakes? http://wiki.apache.org/solr/SpellCheckComponent Cheers Avlesh On Mon, Aug 31, 2009 at 3:30 PM, Kaoul kaoul@gmail.com wrote: Hello, I'm new to Solr and don't find in documentation how-to to set similarity. I want it more flexible, as if I make a mistake with letters, results are found like with google. Thank you in advance.
Re: Date Faceting and Double Counting
I don't think this behavior needs to be fixed. It is justified for the data you have indexed. date minus 1 second should definitely work for you. Cheers Avlesh On Mon, Aug 31, 2009 at 11:37 PM, Stephen Duncan Jr stephen.dun...@gmail.com wrote: If we do date faceting and start at 2009-01-01T00:00:00Z, end at 2009-01-03T00:00:00Z, with a gap of +1DAY, then documents that occur at exactly 2009-01-02T00:00:00Z will be included in both the returned counts (2009-01-01T00:00:00Z and 2009-01-02T00:00:00Z). At the moment, this is quite bad for us, as we only index the day-level, so all of our documents are exactly on the line between each facet-range. Because we know our data is indexed as being exactly at midnight each day, I think we can simply always start from 1 second prior and get the results we want (start=2008-12-31T23:59:59Z, end=2009-01-02T23:59:59Z), but I think this problem would affect everyone, even if usually more subtly (instead of all documents being counted twice, only a few on the fencepost between ranges). Is this a known behavior people are happy with, or should I file an issue asking for ranges in date-facets to be constructed to subtract one second from the end of each range (so that the effective range queries for my case would be: [2009-01-01T00:00:00Z TO 2009-01-01T23:59:59Z] [2009-01-02T00:00:00Z TO 2009-01-02T23:59:59Z])? Alternatively, is there some other suggested way of using the date faceting to avoid this problem? -- Stephen Duncan Jr www.stephenduncanjr.com
Re: Hierarchical schema design
As Uri has already replied, there is no concept of a hierarchical schema in Solr. My gut feeling says you might be talking about Multiple coreshttp://www.google.co.in/search?q=multiple+core+solrie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a . Cheers Avlesh On Mon, Aug 31, 2009 at 5:26 PM, Pooja Verlani pooja.verl...@gmail.comwrote: Hi all, Is there a possibility to have a hierarchical schema in solr, meaning can we have objects under objects. For example, for a doc like: doc a1 b1 b2 b3 /a1 a2 b1 b2 ,b3 /a2 . . . . . . . /doc I need to make schema with 3 types of such objects and all of them having different field values for each. Please reply if there exists such a possibility. Regards. Pooja
Re: Is caching worth it when my whole index is in RAM?
Good question! The application level cache, say filter cache, would still help because it not only caches values but also the underlying computation. Even with all the data in your RAM you will still end up doing the computations every time. Looking for responses from the more knowledgeable. Cheers Avlesh On Mon, Aug 31, 2009 at 8:25 PM, Michael solrco...@gmail.com wrote: Hi, If I've got my entire 20G 4MM document index in RAM (on a ramdisk), do I have a need for the document cache? Or should I set it to 0 items, because pulling field values from an index in RAM is so fast that the document cache would be a duplication of effort? Are there any other caches that I should turn off if I can get my entire index in RAM? Filter cache, query results cache, etc? Thanks! Michael
Re: filtering facets
when I will have more motivation, I will submit a patch to solr :-) You want to add more here?- https://issues.apache.org/jira/browse/SOLR-1387 Cheers Avlesh On Tue, Sep 1, 2009 at 2:51 AM, Olivier H. Beauchesne oliv...@olihb.comwrote: yeah, but then I would have to retrieve *a lot* of facets. I think for now i'll retrieve all the subdomains with facet.prefix and then merge those queries. Not ideal, but when I will have more motivation, I will submit a patch to solr :-) Michael a écrit : You could post-process the response and remove urls that don't match your domain pattern. On Mon, Aug 31, 2009 at 9:45 AM, Olivier H. Beauchesne oliv...@olihb.com wrote: Hi Mike, No, my problem is that the field article_outlinks is multivalued thus it contains several urls not related to my search. I would like to facet only urls matching my query. For exemple(only on one document, but my search targets over 1M docs): Doc1: article_url: url1.com/1 url2.com/2 url1.com/1 url1.com/3 And my query is: article_url:url1.com* and I facet by article_url and I want it to give me: url1.com/1 (2) url1.com/3 (1) But right now, because url2.com/2 is contained in a multivalued field with the matching urls, I get this: url1.com/1 (2) url1.com/3 (1) url2.com/2 (1) I can use facet.prefix to filter, but it's not very flexible if my url contains a subdomain as facet.prefix doesn't support wildcards. Thank you, Olivier Mike Topper a écrit : Hi Olivier, are the facet counts on the urls you dont want 0? if so you can use facet.mincount to only return results greater than 0. -Mike Olivier H. Beauchesne wrote: Hi, Long time lurker, first time poster. I have a multi-valued field, let's call it article_outlinks containing all outgoing urls from a document. I want to get all matching urls sorted by counts. For exemple, I want to get all outgoing wikipedia url in my documents sorted by counts. So I execute a query like this: q=article_outlinks:http*wikipedia.org* and I facet on article_outlinks But I get facets containing the other urls in the documents. I can get something close by using facet.prefix=http://en.wikipedia.org but I want to include other subdomains on wikipedia (ex: fr.wikipedia.org). Is there a way to do a search and getting facets only matching my query? I know facet.prefix isn't a query, but is there a way to get that behavior? Is it easy to extend solr to do something like that? Thank you, Olivier Sorry for my english.