Re: solr as nosql - pulling all docs vs deep paging limitations
You can do range queries without an upper bound and just limit the number of results. Then you look at the last result to obtain the new lower bound. -- Jens On 17/12/13 20:23, Petersen, Robert wrote: My use case is basically to do a dump of all contents of the index with no ordering needed. It's actually to be a product data export for third parties. Unique key is product sku. I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code). -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Tuesday, December 17, 2013 10:41 AM To: solr-user Subject: Re: solr as nosql - pulling all docs vs deep paging limitations Hoss, What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that. What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS? Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with? On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Then I remembered we currently don't allow deep paging in our current : search indexes as performance declines the deeper you go. Is this still : the case? Coincidently, i'm working on a new cursor based API to make this much more feasible as we speak.. https://issues.apache.org/jira/browse/SOLR-5463 I did some simple perf testing of the strawman approach and posted the results last week... http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat ion-of-large-result-sets/ ...current iterations on the patch are to eliminate the strawman code to improve performance even more and beef up the test cases. : If so, is there another approach to make all the data in a collection : easily available for retrieval? The only thing I can think of is to ... : Then I was thinking we could have a field with an incrementing numeric : value which could be used to perform range queries as a substitute for : paging through everything. Ie queries like 'IncrementalField:[1 TO : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to : maintain as we update the index unless we reindex the entire collection : every time we update any docs at all. As i mentioned in the blog above, as long as you have a uniqueKey field that supports range queries, bulk exporting of all documents is fairly trivial by sorting on your uniqueKey field and using an fq that also filters on your uniqueKey field modify the fq each time to change the lower bound to match the highest ID you got on the previous page. This approach works really well in simple cases where you wnat to fetch all documents matching a query and then process/sort them by some other criteria on the client -- but it's not viable if it's important to you that the documents come back from solr in score order before your client gets them because you want to stop fetching once some criteria is met in your client. Example: you have billions of documents matching a query, you want to fetch all sorted by score desc and crunch them on your client to compute some stats, and once your client side stat crunching tells you you have enough results (which might be after the 1000th result, or might be after the millionth result) then you want to stop. SOLR-5463 will help even in that later case. The bulk of the patch should easy to use in the next day or so (having other people try out and test in their applications would be *very* helpful) and hopefully show up in Solr 4.7 -Hoss http://www.lucidworks.com/ -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Querying a transitive closure?
Exactly, you should usually design your schema to fit your queries, and if you need to retrieve all ancestors then you should index all ancestors so you can query for them easily. If that doesn't work for you then either Solr is not the right tool for the job, or you need to rethink your schema. The description of doing lookups within a tree structure doesn't sound at all like what you would use a text retrieval engine for, so you might want to rethink why you want to use Solr for this. But if that transitive closure is something you can calculate at indexing time then the correct solution is the one Upayavira provided. If you want people to be able to help you you need to actually describe your problem (i.e. what is my data, and what are my queries) instead of diving into technical details like reducing HTTP roundtrips. My guess is that if you need to reduce HTTP roundtrips you're probably doing it wrong. HTH, Jens On 03/28/2013 08:15 AM, Upayavira wrote: Why don't you index all ancestor classes with the document, as a multivalued field, then you could get it in one hit. Am I missing something? Upayavira On Thu, Mar 28, 2013, at 01:59 AM, Jack Park wrote: Hi Otis, That's essentially the answer I was looking for: each shard (are we talking master + replicas?) has the plug-in custom query handler. I need to build it to find out. What I mean is that there is a taxonomy, say one with a single root for sake of illustration, which grows all the classes, subclasses, and instances. If I have an object that is somewhere in that taxonomy, then it has a zigzag chain of parents up that tree (I've seen that called a transitive closure. If class B is way up that tree from M, no telling how many queries it will take to find it. Hmmm... recursive ascent, I suppose. Many thanks Jack On Wed, Mar 27, 2013 at 6:52 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jack, I don't fully understand the exact taxonomy structure and your needs, but in terms of reducing the number of HTTP round trips, you can do it by writing a custom SearchComponent that, upon getting the initial request, does everything locally, meaning that it talks to the local/specified shard before returning to the caller. In SolrCloud setup with N shards, each of these N shards could be queried in such a way in parallel, running query/queries on their local shards. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 27, 2013 at 3:11 PM, Jack Park jackp...@topicquests.org wrote: Hi Otis, I fully expect to grow to SolrCloud -- many shards. For now, it's solo. But, my thinking relates to cloud. I look for ways to reduce the number of HTTP round trips through SolrJ. Maybe you have some ideas? Thanks Jack On Wed, Mar 27, 2013 at 10:04 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi Jack, Is this really about HTTP and Solr vs. SolrCloud or more whether Solr(Cloud) is the right tool for the job and if so how to structure the schema and queries to make such lookups efficient? Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Mar 27, 2013 at 12:53 PM, Jack Park jackp...@topicquests.org wrote: This is a question about isA? We want to know if M isA B isA?(M,B) For some M, one might be able to look into M to see its type or which class(es) for which it is a subClass. We're talking taxonomic queries now. But, for some M, one might need to ripple up the transitive closure, looking at all the super classes, etc, recursively. It seems unreasonable to do that over HTTP; it seems more reasonable to grab a core and write a custom isA query handler. But, how do you do that in a SolrCloud? Really curious... Many thanks in advance for ideas. Jack
Re: SOLR 4.2 SNAPSHOT There exists no core with name x
On 03/01/2013 07:46 PM, Neal Ensor wrote: Again, it appears to work on Safari fine hitting the same container, so must be something Chrome-specific (perhaps something I have disabled?) This sounds like it might just be a browser cache issue (if you used Chrome to access the same URL previously with the old Solr version installed). It might just not be refreshing everything. Jens
Re: SOLR 4.2 SNAPSHOT There exists no core with name x
Yes, we've had quite a few surprises with outdated information (and mixtures of old and new information) in the admin UI, so I'd definitely be in favor of getting rid of caching. Jens On 03/04/2013 04:03 PM, Stefan Matheis wrote: Thanks Jens! Didn't think about caching .. :/ Perhaps we should change the requests in favor of https://issues.apache.org/jira/browse/SOLR-4311 to avoid any caching at the UI? Results maybe in a few more (real) requests but i guess that would be okay? Stefan On Monday, March 4, 2013 at 2:21 PM, Neal Ensor wrote: Actually, just updated Chrome this morning, and it all appears to work. Flushed cache as well, so could be part of that. All's well that ends well I suppose. neal On Mon, Mar 4, 2013 at 4:44 AM, Jens Grivolla j+...@grivolla.net (mailto:j+...@grivolla.net) wrote: On 03/01/2013 07:46 PM, Neal Ensor wrote: Again, it appears to work on Safari fine hitting the same container, so must be something Chrome-specific (perhaps something I have disabled?) This sounds like it might just be a browser cache issue (if you used Chrome to access the same URL previously with the old Solr version installed). It might just not be refreshing everything. Jens
Re: configuring schema to match database
On 01/11/2013 06:14 PM, Gora Mohanty wrote: On 11 January 2013 22:30, Jens Grivolla j+...@grivolla.net wrote: [...] Actually, that is what you would get when doing a join in an RDBMS, the cross-product of your tables. This is NOT AT ALL what you typically do in Solr. Best start the other way around, think of Solr as a retrieval system, not a storage system. What are your queries? What do you want to find, and what criteria do you use to search for it? [...] Um, he did describe his desired queries, and there was a reason that I proposed the above schema design. He said he wants queries such as users how have taken courseA and are fluent in english, which is exactly one case I was describing. UserA has taken courseA, courseB and courseC and has writingskill good verbalskill good for english and writingskill excellent verbalskill excellent for spanish UserB has taken courseA, courseF, courseG and courseH and has writingskill fluent verbalskill fluent for english and writingskill good verbalskill good for italian Unless the index is becoming huge, I feel that it is better to flatten everything out rather than combine fields, and post-process the results. Then please show me the query to find users that are fluent in spanish and english. Bonus points if you manage to not retrieve the same user several times. (Hint, your schema stores only one language skill per row). Regards, Jens
Re: configuring schema to match database
On 01/14/2013 12:50 PM, Gora Mohanty wrote: On 14 January 2013 16:59, Jens Grivolla j+...@grivolla.net wrote: [...] Then please show me the query to find users that are fluent in spanish and english. Bonus points if you manage to not retrieve the same user several times. (Hint, your schema stores only one language skill per row). Doh! You are right, of course. Brainfart from my side. Ok, I was starting to wonder if I was the one missing something. Re-reading what I wrote I see I may have sounded a bit rude, that was not my intention, sorry. Best, Jens
Re: configuring schema to match database
On 01/11/2013 05:23 PM, Gora Mohanty wrote: You are still thinking of Solr as a RDBMS, where you should not be. In your case, it is easiest to flatten out the data. This increases the size of the index, but that should not really be of concern. As your courses and languages tables are connected only to user, the schema that I described earlier should suffice. To extend my earlier example, given: * userA with courses c1, c2, c3, and languages l1, l2 * userB with c2, c3, and l2 you should flatten it such that you get the following Solr documents userA c1 name c1 startdate...l1 l1 writing skill... userA c1 name c1 startdate...l2 l2 writing skill... userA c2 name c2 startdate...l1 l1 writing skill... userB c2 name c2 startdate...l2 l2 writing skill... userB c3 name c3 startdate...l2 l2 writing skill... i.e., a total of 3 courses x 2 languages = 6 documents for userA, and 2 courses x 1 language = 2 documents for userB Actually, that is what you would get when doing a join in an RDBMS, the cross-product of your tables. This is NOT AT ALL what you typically do in Solr. Best start the other way around, think of Solr as a retrieval system, not a storage system. What are your queries? What do you want to find, and what criteria do you use to search for it? If your intention is to find users that match certain criteria, each entry should be a user (with ALL associated information, e.g. all courses, all language skills, etc.), if you want to retrieve courses, each entry should be a course. Let's say you want to find users who have certain language skills, you would have a schema that describes a user: - user id - user name - languages - ... In languages, you could store e.g. things like: en|reading|high es|writing|low, etc. It could be a multivalued field or just have everything separated by space and a tokenizer that splits on whitespace. Now you can query: - language:es* -- return all users with some spanish skills - language:en|writing|high -- return all users with high english writing skills - +(language:es* language:fr*) +language:en|writing|high -- return users with high english writing skills and some knowledge of french or spanish If you want to avoid wildcard queries (more costly) you can just add plain en and es, etc. to your field so language:es will match anybody with spanish skills. Best, Jens
Re: Multicore solr
So are you even doing text search in Solr at all, or just using it as a key-value store? If the latter, do you have your schema configured so that only the search_id field is indexed (with a keyword tokenizer) and everything else only stored? Also, are you sure that Solr is the best option as a key-value store? Jens On 05/23/2012 04:34 AM, Amit Jha wrote: Hi, Thanks for your advice. It is basically a meta search application. Users can perform a search on N number of data sources at a time. We broadcast Parallel search to each selected data sources and write data to solr using custom build API(API and solr are deployed on separate machine API job is to perform parallel search, write data to solr ). API respond to application that some results are available then application fires a search query to display the results(query would be q=unique_search_id). And other side API keep writing data to solr and user can fire a search to solr to view all results. In current scenario we are using single solr server we performing real time index and search. Performing these operations on single solr making process slow as index size increases. So we are planning to use multi core solr and each user will have its core. All core will have the same schema. Please suggest if this approach has any issues. Rgds AJ On 22-May-2012, at 20:14, Sohail Aboobakersabooba...@gmail.com wrote: It would help if you provide your use case. What are you indexing for each user and why would you need a separate core for indexing each user? How do you decide schema for each user? It might be better to describe your use case and desired results. People on the list will be able to advice on the best approach. Sohail
Re: Wildcard-Search Solr 3.5.0
Maybe a filter like ISOLatin1AccentFilter that doesn't get applied when using wildcards? How do the terms actually appear in the index? Jens On 05/23/2012 01:19 PM, spr...@gmx.eu wrote: No one an idea? Thx. The text may contain FooBar. When I do a wildcard search like this: Foo* - no hits. When I do a wildcard search like this: foo* - doc is found. Please see http://wiki.apache.org/solr/MultitermQueryAnalysis Well, it works in 3.6. With one exception: If I use german umlauts it does not work anymore. Text: Bär Bä* - no hits Bär - hits What can I do in this case? Thank you
Re: mysolr python client
On 11/30/2011 05:40 PM, Marco Martinez wrote: For anyone interested, recently I've been using a new Solr client for Python. It's easy and pretty well documented. If you're interested its site is: http://mysolr.redtuna.org/ Do you know what advantages it has over pysolr or solrpy? On the page it only says mysolr was born to be a fast and easy-to-use client for Apache Solr’s API and because existing Python clients didn’t fulfill these conditions. Thanks, Jens
Re: MoreLikeThis and two field in mlt.fl
On 11/25/2010 10:06 AM, Damien Fontaine wrote: I have a problem with MoreLikeThis on Solr 1.4.1. I can't put two field on mlt.fl. Example : text and title, only text is in interestingTerms It should work. My guess is that the terms from the title simply don't make the cut due to mlt.mintf, which is often set so that only terms appearing multiple times are considered. HTH, Jens