How to do a reverse distance search?
Hi everybody, Let's say we have 10,000 traveling sales-people spread throughout the country. Each of them has has their own territory, and most of the territories overlap (eg. 100 sales-people in a particular city alone). Each of them also has a maximum distance they can travel. Some can travel country-wide, others don't have a car and are limited to a 10mi radius. Given that we have a client at a particular location, how do we construct a query in Solr that finds all the sales-people who can reach that client? We think we have a solution for this, but I want to know what you think. And, in SQL this is relatively easy: select * from salespeople where calc_distance(CLIENT_LAT, CLIENT_LONG, lat, long) maxTravelDist But a problem is that calc_distance() is fairly expensive. If it was our client that specified the distance, it would be easy to include it as part of the search criteria in the Solr query, but unfortunately it's each individual sales-person that specifies a distance. Sincerely, Daryl.
Re: Suggestions needed: Lots of updates for tiny changes
Hi Otis, Thanks for your reply and for giving it some thought. Actually we have considered using something that lives outside of the main index... We've looked into using the ExternalFileField, but abandoned that when it became clear we'd have to use a function to use it, and that limited how we could use the field in our searches. For another more-real-time data problem we're having, we've considered writing a search handler and search component to handle it as a filter-query. This is equivalent to the data structure outside of the main index that you have proposed. The problem with it is that getting it to be *part of the index* is difficult. Well... any more ideas would be appreciated. But thanks for your help so far. - Daryl. On Fri, Jul 3, 2009 at 9:34 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I don't have a very specific suggestion, but I wonder if you could have a data structure that lives outside of the main index and keeps only these dates. Presumably this smaller data structure would be simpler/faster to update, and you'd just have to remain in sync with the main index (document-document mapping). I think ParallelReader in Lucene is a similar approach, as it Solr's ExternalFileField. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Development Team dev.and...@gmail.com To: solr-user@lucene.apache.org Sent: Friday, July 3, 2009 4:46:37 PM Subject: Suggestions needed: Lots of updates for tiny changes Hi everybody, Let's say I had an index with 10M large-ish documents, and as people logged into a website and viewed them the last viewed date was updated to the current time. We index a document's last-viewed-date because we allow users to a) search on this last-viewed-date alongside all other searchable criteria, and b) we can order results of any search by the last-viewed-date. The problem is that in a given 5-minute period, we may have many thousands of updated documents (due to this simple last-viewed-date). We have a task that looks for changed documents, loads the full documents, and then feeds them into Solr to update the index, but unfortunately reading these changed documents and continually feeding them to Solr is generating * far* more load on our system (both Solr and the database) than any of the searches. In a given day, *we may have more updates to documents than we have total documents indexed*. (Databases don't handle this well either, the contention on rows for updates slows the database down significantly.) How should we approach this problem? It seems like such a waste of resources to be doing so much work in applications/database/solr only for last-viewed-dates. Solutions we've looked at include: 1) Update only partial document. --Apparently this isn't supported in Solr yet (we're using nightly Solr 1.4 builds currently). 2) Use near-real-time updates. --Not supported yet. Also, the freshness of the data isn't as much as concern as the sheer volume of changes that we have to make here. For example, we could update Solr less-fequently, but then we'd just have many more documents to update. The data only has to be, say, fresh to within 30 minutes. 3) Use a separate index for the last-viewed-date. --This won't work because we need to search on the last-viewed-date alongside other criteria, and we use it as scoring criteria for all our searches. Any suggestions? Sincerely, Daryl.
Suggestions needed: Lots of updates for tiny changes
Hi everybody, Let's say I had an index with 10M large-ish documents, and as people logged into a website and viewed them the last viewed date was updated to the current time. We index a document's last-viewed-date because we allow users to a) search on this last-viewed-date alongside all other searchable criteria, and b) we can order results of any search by the last-viewed-date. The problem is that in a given 5-minute period, we may have many thousands of updated documents (due to this simple last-viewed-date). We have a task that looks for changed documents, loads the full documents, and then feeds them into Solr to update the index, but unfortunately reading these changed documents and continually feeding them to Solr is generating * far* more load on our system (both Solr and the database) than any of the searches. In a given day, *we may have more updates to documents than we have total documents indexed*. (Databases don't handle this well either, the contention on rows for updates slows the database down significantly.) How should we approach this problem? It seems like such a waste of resources to be doing so much work in applications/database/solr only for last-viewed-dates. Solutions we've looked at include: 1) Update only partial document. --Apparently this isn't supported in Solr yet (we're using nightly Solr 1.4 builds currently). 2) Use near-real-time updates. --Not supported yet. Also, the freshness of the data isn't as much as concern as the sheer volume of changes that we have to make here. For example, we could update Solr less-fequently, but then we'd just have many more documents to update. The data only has to be, say, fresh to within 30 minutes. 3) Use a separate index for the last-viewed-date. --This won't work because we need to search on the last-viewed-date alongside other criteria, and we use it as scoring criteria for all our searches. Any suggestions? Sincerely, Daryl.
Re: Solr Jetty confusion
Hi Brett, Well, I'm running Solr in Jetty with JBoss, so I used the JBoss method of specifying properties (properties-service.xml). However, you can supply the solr-home to the command-line when you start Jetty by using a parameter like, -Dsolr.solr.home=C:\solr. You can do it like how they do it for Tomcat: http://wiki.apache.org/solr/SolrTomcat?highlight=(solr.home) You mention your code is not compiling... the code should be able to compile whether or not you can actually start solr with the right solr-home. It should also compile regardless of how what container you deploy Solr into. What exactly are you trying to do besides getting Solr to start in Jetty? - Daryl. On Thu, Jun 18, 2009 at 9:58 PM, pof melbournebeerba...@gmail.com wrote: Development Team wrote: To specify the solr-home I use a Java system property (instead of the JNDI way) since I already have other necessary system properties for my apps. Could you please give me a concrete example of how you did this? There is no example code or commandline examples to be found. Cheers, Brett. -- View this message in context: http://www.nabble.com/Solr-Jetty-confusion-tp24087264p24104378.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does Solr 1.4 really work nicely on Jboss 4?
Hi Giovanni, Solr 1.4 does work fine in JBoss (all of the features, including all of the admin pages). For example, I am running it in JBoss 4.0.5.GA on JDK 1.5.0_18 without problems. I am also using Jetty instead of Tomcat, however instructions for getting it to work in JBoss with Tomcat can be found here: http://wiki.apache.org/solr/SolrJBoss It should work fine on JBoss 4.0.1. - Daryl. On Thu, Jun 18, 2009 at 8:57 AM, Giovanni De Stefano giovanni.destef...@gmail.com wrote: Hello all, I have a simple question :-) In my project it is mandatory to use Jboss 4.0.1 SP3 and Java 1.5.0_06/08. The software relies on Solr 1.4. Now, I am aware that some JSP Admin pages will not be displayed due to some Java5/6 dependency but this is not a problem because rewriting some of the JSPs it is possible to have everything up and running. The real question is: is anybody aware of any feature that might not work when deploying the solr based software in Jboss 4? I look forward to hearing your experience. Cheers, Giovanni
Re: Solr Jetty confusion
Hey, So... I'm assuming your problem is that you're having trouble deploying Solr in Jetty? Or is your problem that it's deploying just fine but your code throws an exception when you try to run it? I am running Solr in Jetty, and I just copied the war into the webapps directory and it worked. It was accessible under /solr, and it was accessible under the port that Jetty has as its HTTP listener (which is probably 8080 by default, but probably won't be 8983). To specify the solr-home I use a Java system property (instead of the JNDI way) since I already have other necessary system properties for my apps. So if your problem turns out to be with the JNDI, sorry I won't be of much help. Hope that helps... - Daryl. On Thu, Jun 18, 2009 at 2:44 AM, pof melbournebeerba...@gmail.com wrote: Hi, I am currently trying to write a Jetty embedded java app that implements SOLR and uses SOLRJ by excepting posts telling it to do a batch index, or a deletion or what have you. At this point I am completely lost trying to follow http://wiki.apache.org/solr/SolrJetty . In my constructor I am doing the following call: Server server = new Server(); XmlConfiguration configuration = new XmlConfiguration(new FileInputStream(solrjetty.xml)); My xml has two calls, an addConnector to configure the port etc. and the addWebApplication as specified on the solr wiki. When running the app I get this: Exception in thread main java.lang.IllegalStateException: No Method: Call name=addWebApplicationArg/solr/*/ArgArg/webapps/solr.war/ArgSet name=extractWARtrue/SetSet name=defaultsDescriptororg/mortbay/jetty/servlet/webdefault.xml/SetCall name=addEnvEntryArg/solr/home/ArgArg type=String/solr/home/Arg/Call/Call on class org.mortbay.jetty.Server Can anyone point me in the right direction? Thanks. -- View this message in context: http://www.nabble.com/Solr-Jetty-confusion-tp24087264p24087264.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem getting Solr statistics
So for all those wondering what the problem was: It turns out I can't just initialize my own CoreContainer; that just gives me a *new* set of cores, and since those are not the cores being used by the SolrDispatchFilter, they're never accessed and thus the stats remain the same (such as having 2 queries performed on the core) throughout the life of the server. What I had to do was extend the SolrDispatchFilter to gain access to its protected CoreContainer and use that one. (Should Solr expose the core to those who are servicing requests that are not HTTP-based? The SolrDispatchFilter puts the core into the request, but not all things that need access to the core are servlets or filters. For example, I'm using an MBean whose actions are called through SNMP.) - Daryl. On Tue, Jun 16, 2009 at 2:42 PM, Development Team dev.and...@gmail.comwrote: Hi all, I am stumped trying to get statistics from the Solr server. It seems that every time I get the correct SolrInfoMBean, when I look up the proper value (by name) in the NamedList, I get the exact same number back each time. For example, upon start-up the server reports that 2 queries have been performed, and any time I pull the value out of the MBean after that it says 2 even though the stats.jsp reports an increasing number of queries over time. What am I doing wrong? Here is my sample code: public class SolrUtil { protected static final CoreContainer coreContainer; protected static final String DEFAULT_CORE_NAME = ; static { CoreContainer.Initializer initializer = new CoreContainer.Initializer(); try { coreContainer = initializer.initialize(); } catch (Exception e) { throw new ExceptionInInitializerError(Can't initialize core container: + e.getMessage()); } initialize(); } private static SolrCore getCore() { return getCore(DEFAULT_CORE_NAME); } private static SolrCore getCore(String name) { try { return coreContainer.getCore(name); } catch (Exception e) { e.printStackTrace(); } return null; } public static String getSolrInfoMBeanValue(SolrInfoMBean.Category category, String entryName, String statName) { MapString, SolrInfoMBean registry = getCore().getInfoRegistry(); for (Map.EntryString, SolrInfoMBean entry : registry.entrySet()) { String key = entry.getKey(); SolrInfoMBean solrInfoMBean = entry.getValue(); if ((solrInfoMBean.getCategory() != category) || (!entryName.equals(key.trim( { continue; } NamedList? nl = solrInfoMBean.getStatistics(); if ((nl != null) (nl.size() 0)) { for (int i = 0; i nl.size(); i++) { if (nl.getName(i).equals(statName)) { return nl.getVal(i).toString(); } } } } return null; } [...I have other methods, that also get the value as a long, etc] } This code is modeled after the SolrDispatchFilter.java, _info.jsp and stats.jsp. I'd appreciate any help. (And yes, my core is named .) Sincerely, Daryl.
Problem getting Solr statistics
Hi all, I am stumped trying to get statistics from the Solr server. It seems that every time I get the correct SolrInfoMBean, when I look up the proper value (by name) in the NamedList, I get the exact same number back each time. For example, upon start-up the server reports that 2 queries have been performed, and any time I pull the value out of the MBean after that it says 2 even though the stats.jsp reports an increasing number of queries over time. What am I doing wrong? Here is my sample code: public class SolrUtil { protected static final CoreContainer coreContainer; protected static final String DEFAULT_CORE_NAME = ; static { CoreContainer.Initializer initializer = new CoreContainer.Initializer(); try { coreContainer = initializer.initialize(); } catch (Exception e) { throw new ExceptionInInitializerError(Can't initialize core container: + e.getMessage()); } initialize(); } private static SolrCore getCore() { return getCore(DEFAULT_CORE_NAME); } private static SolrCore getCore(String name) { try { return coreContainer.getCore(name); } catch (Exception e) { e.printStackTrace(); } return null; } public static String getSolrInfoMBeanValue(SolrInfoMBean.Category category, String entryName, String statName) { MapString, SolrInfoMBean registry = getCore().getInfoRegistry(); for (Map.EntryString, SolrInfoMBean entry : registry.entrySet()) { String key = entry.getKey(); SolrInfoMBean solrInfoMBean = entry.getValue(); if ((solrInfoMBean.getCategory() != category) || (!entryName.equals(key.trim( { continue; } NamedList? nl = solrInfoMBean.getStatistics(); if ((nl != null) (nl.size() 0)) { for (int i = 0; i nl.size(); i++) { if (nl.getName(i).equals(statName)) { return nl.getVal(i).toString(); } } } } return null; } [...I have other methods, that also get the value as a long, etc] } This code is modeled after the SolrDispatchFilter.java, _info.jsp and stats.jsp. I'd appreciate any help. (And yes, my core is named .) Sincerely, Daryl.
Re: Solr query performance issue
Yes, those terms are important in calculating the relevancy scores so they are not in the filter queries. I was hoping if I can cache everything about a field, any combinations on the field values will be read from cache. Then it does not matter if I query for field1:(02 04 05), or field1:(01 02) or field1:03 the response time is equally quick. Is there anyway to achieve that? Yeah, the range queries are also a bottleneck too, I will give the TrieRange fields a try. Thanks for you advice. Best Regards, Shi Quan He On Tue, May 26, 2009 at 3:55 PM, Yonik Seeley yo...@lucidimagination.comwrote: On Tue, May 26, 2009 at 3:42 PM, Larry He shiqua...@gmail.com wrote: We have about 100 different fields and 1 million documents we indexed with Solr. Many of the fields are multi-valued, and some are numbers (for range search). We are expecting to perform solr queries contains over 30 terms and often the response time is well over a second. I found that the caches in Solr such as QueryResultCache and FilterCache does not help us much in this case as most of the queries have combinations of terms that are unlikely to repeat. An example of our query would look like: field1:(02 04 05) field2:(01 02 03) field2:(01 02 03) ... My question is how can we improve performance of these queries? filters are independently cached... but they are currently only AND filters, so you could only split it up like so: fq=field1:(02 04 05)fq=field2:(01 02 03)fq=field2:(01 02 03) But that won't help unless any of the individual fq params are repeated across different queries. Range search can also be sped up a lot via the use of the new TrieRange fields, or via the frange (function range query) capabilities in Solr 1.4 it's not clear if the range queries or the term queries are your current bottleneck. If the range queries aren't your bottleneck and separate filters don't work, then a query type could be developed that would help your situation by caching matches on term queries. Are relevancy scores important for the clauses like field1:(02 04 05), or do you sort by some other criteria? -Yonik http://www.lucidimagination.com
How to manage real-time (presence) data in a large index?
Hi everybody, I have a relatively large index (it will eventually contain ~4M documents and be about 3G in size, I think) that indexes user data, settings, and the like. The documents represent a community of users whereupon a subset of them may be online at any time. Also, we want to score our search results across searches that span the whole index by the online (i.e. presence) status. Right now the list of online members is kept in a database table, however we very often need to search on these users. The problem is, we're using Solr for our searches and we don't know how to approach setting up a search system for a large amount of highly volatile data. How do people typically go about this? Do they do one of the following: 1) Set up a second core and keep only index the online members in there? (Then we could not score normal search results by online status.) 2) Index the online status in our regular solr index and not worry about it? (If it's fast to update docs in a large index, then why not maintain real-time data in the main index?) 3) Just use a database for the presence data and forget about using Solr for the presence-related searches? Is there anything in Solr that I should be looking into to help with this problem? I'd appreciate any help. Sincerely, Daryl.
Sort by distance from location?
Hi everybody, My index has latitude/longitude values for locations. I am required to do a search based on a set of criteria, and order the results based on how far the lat/long location is to the current user's location. Currently we are emulating such a search by adding criteria of ever-widening bounding boxes, and the more of those boxes match the document, the higher the score and thus the closer ones appear at the start of the results. The query looks something like this (newlines between each search term): +criteraOne:1 +criteriaTwo:true +latitude:[-90.0 TO 90.0] +longitude:[-180.0 TO 180.0] (latitude:[40.52 TO 40.81] longitude:[-74.17 TO -73.79]) (latitude:[40.30 TO 41.02] longitude:[-74.45 TO -73.51]) (latitude:[39.94 TO 41.38] longitude:[-74.93 TO -73.03]) [[...etc...about 10 times...]] Naturally this is quite slow (query is approximately 6x slower than normal), and... I can't help but feel that there's a more elegant way of sorting by distance. Does anybody know how to do this or have any suggestions? Sincerely, Daryl.
Re: Sort by distance from location?
Ah, good question: Yes, we've tried it... and it was slower. To give some avg times: Regular non-distance Searches: 100ms Our expanding-criteria solution: 600ms LocalSolr: 800ms (We also had problems with LocalSolr in that the results didn't seem to be cached in Solr upon doing a search. So each page of results meant another 800ms.) - Daryl. On Tue, Apr 14, 2009 at 5:34 PM, Smiley, David W. dsmi...@mitre.org wrote: Have you tried LocalSolr? http://www.gissearch.com/localsolr (I haven’t but looks cool)
How to create a query directly (bypassing the query-parser)?
Hi everybody, after reading the documentation on the Solr site, I have the following newbie-ish question: On the Lucene query parser syntax page ( http://lucene.apache.org/java/2_4_0/queryparsersyntax.html) linked to from the Solr query syntax page, they mention: If you are programmatically generating a query string and then parsing it with the query parser then you should seriously consider building your queries directly with the query API. In other words, the query parser is designed for human-entered text, not for program-generated text. What do they mean by using the API? If I use SolrJ to construct a SolrQuery, doesn't that get processed by the query parser? How do I bypass the query parser to set up a query directly? Especially for token-values (values that fit a defined set, such as Enum values), it seems silly for me to continually be appending, +tokenField:(1, 2, 3) to my query. Why should I write code to construct the query string, then send this to the parser to parse the string into an object? Can't I set these query parameters directly? If so, how? - Daryl.
Birthday (that's day not date) search query?
Hi everyone, I have an index that stores birth-dates, and I would like to search for anybody whose birth-date is within X days of a certain month/day. For example, I'd like to know if anybody's birthday is coming up within a certain number of days, regardless of what year they were born. How would I do this using Solr? As a follow-up, assuming this query is executed very often, should I maybe be indexing something other than the birth-date? Such as just the month-day pair? What is the most efficient way to do such a query? Sincerely, Daryl.