Re: Fuzzy searching documents over multiple fields using Solr
I didn't mention it but I'd like individual fields to contribute to the overall score on a continuum instead of 1 (match) and 0 (no match), which will lead to more fine-grained scoring. A contrived example: all other things equal a tv of 40 inch should score higher than a 38 inch tv when searching for a 42 inch tv. This based on some distance modeling on the 'size' -field. (eg: score(42,40) = 0.6 and score(42,38) = 0,4). Other qualitative fields may be modeled in the same way: (e.g: restaurants with field 'price' with values: 'budget','mid-range', 'expensive', ...) Any way to incorporate this? 2013/5/9 Jack Krupansky j...@basetechnology.com A simple OR boolean query will boost documents that have more matches. You can also selectively boost individual OR terms to control importance. And do and AND for the required terms, like tv. -- Jack Krupansky -Original Message- From: britske Sent: Thursday, May 09, 2013 11:21 AM To: solr-user@lucene.apache.org Subject: Fuzzy searching documents over multiple fields using Solr Not sure if this has ever come up (or perhaps even implemented without me knowing) , but I'm interested in doing Fuzzy search over multiple fields using Solr. What I mean is the ability to returns documents based on some 'distance calculation' without documents having to match 100% to the query. Usecase: a user is searching for a tv with a couple of filters selected. No tv matches all filters. How to come up with a bunch of suggestions that match the selected filters as closely as possible? The hard part is to determine what 'closely' means in this context, etc. This relates to (approximate) nearest neighbor, Kd-trees, etc. Has anyone ever tried to do something similar? any plugins, etc? or reasons Solr/Lucene would/wouldn't be the correct system to build on? Thanks -- View this message in context: http://lucene.472066.n3.** nabble.com/Fuzzy-searching-**documents-over-multiple-** fields-using-Solr-tp4061867.**htmlhttp://lucene.472066.n3.nabble.com/Fuzzy-searching-documents-over-multiple-fields-using-Solr-tp4061867.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: modeling prices based on daterange using multipoints
2012/12/12 David Smiley (@MITRE.org) dsmi...@mitre.org britske wrote Hi David, Yeah interesting (as well as problematic as far is implementing) use-case indeed :) 1. You mention there are no special caches / memory requirements inherent in this.. For a given user-query this would mean all hotels would have to seach for all point.x each time right? What would be a good plugin-point to build in some custom cached filter code for this (perhaps using the Solr Filter cache)? As I see it, determining all hotels that have a particular point.x value is probably: A) pretty costly to do on each user query. B). is static and can be cached easily without a lot of memory (relatively speaking) i.e: 20.000 filters (representing all of the 20.000 different point.x, that is, lt;date,duration,nr persons, roomtypegt; combos) with a bitset per filter representing ids of hotels that have the said point.x. I think you're over-thinking the complexity of this query. I bet it's faster than you think and even then putting this in a filter query 'fq' is going to be cached by Solr any way, making it lightning fast at subsequent queries. Ah! Didn't realize such a spatial query could be dropped in a FQ. Nice, that solves this part indeed. britske wrote 2. I'm not sure I explained C. (sorting) well, since I believe you're talking about implementing custom code to sort multiple point.y's per hotel, correct?. That's not what I need. Instead, for every user-query at most 1 point ever matches. I.e: a hotel has a price for a particular lt;date, duration,nrpersons,roomtypegt;-combo (P.x) or it hasn't. Say a user queries for the lt;date,duration,nrpersons,roomtypegt;-combo: 21 dec 2012,3 days,2 persons, double. This might be encoded into a value, say: 12345. Now, for the hotels that do match that query (i.e: those hotels that have a point P for which P.x=12345) I want to sort those hotels on P.y (the price for the requested P.x) Ah; ok. But still, my first suggestion is still what I think you could do except that the algorithm is simpler -- return the first matching 'y' in the document where the point matches the query. Alternatively, if you're confident the number of matching documents (hotels) is going to be small-ish, say less than a couple hundred, then you could simply sort it client-side. You'd have to get back all the values, or maybe write a DocTransformer to find the specific one. ~ David Writing something similar to ShapeFieldCacheDistanceValueSource, being a valueSource, would enable me to expose it by name to the frontend? What I'm saying is: let's say I want to call this implementation 'pricesort' and chain it with other sorts, like: 'sort=pricesort asc, popularity desc, name asc'. Or use it by name in a functionquery. That would be possible right? Geert-Jan - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-tp4026011p4026256.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: social123 Data Appending Service
No thanks, not sure which site you're talking about btw. But anyway, no thanks Op 26 januari 2012 19:41 schreef Aaron Biddar aaron.bid...@social123.comhet volgende: Hi there- I was on your site today and was not sure who to reach out to. My Company, Social123, provides Social Data Appending for companies that provide lists. In a nutshell, we add Facebook, LinkedIn and Twitter contact information to your current lists. Its a great way to easily offer a new service or add on to your current offerings. Providing social media contact information to your customers will allow them to interact with their customers on a whole new level. If you are the right person to speak with, please let me know your availability for a quick 5-minute demo or check out our tour at www.social123.com. If you are not the right person, would you mind passing this e-mail along? Thanks in advance. -- Aaron Biddar Founder, CEO aaron.bid...@social123.com www.social123.com 78 Alexander St. #K Charleston SC 29403 M 678 925 3556 P 800.505.7295 ex101
Re: multiple dateranges/timeslots per doc: modeling openinghours.
Interesting! Reading your previous blogposts, I gather that the to be posted 'implementation approaches' includes a way of making the SpanQueries available within SOLR? Also, would with your approach would (numeric) RangeQueries be possible as Hoss suggests? Looking forward to that 'implementation post' Cheers, Geert-Jan Op 1 oktober 2011 19:57 schreef Mikhail Khludnev mkhlud...@griddynamics.com het volgende: I agree about SpanQueries. It's a viable measure against false-positive matches on multivalue fields. we've implemented this approach some time ago. Pls find details at http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html and http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html we are going to publish the third post about an implementation approaches. -- Mikhail Khludnev On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Another, faulty, option would be to model opening/closing hours in 2 : multivalued date-fields, i.e: open, close. and insert open/close for each : day, e.g: : : open: 2011-11-08:1800 - close: 2011-11-09:0300 : open: 2011-11-09:1700 - close: 2011-11-10:0500 : open: 2011-11-10:1700 - close: 2011-11-11:0300 : : And queries would be of the form: : : 'open now close now+3h' : : But since there is no way to indicate that 'open' and 'close' are pairwise : related I will get a lot of false positives, e.g the above document would be : returned for: This isn't possible out of the box, but the general idea of position linked queries is possible using the same approach as the FieldMaskingSpanQuery... https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html https://issues.apache.org/jira/browse/LUCENE-1494 ..implementing something like this that would work with (Numeric)RangeQueries however would require some additional work, but it should certianly be doable -- i've suggested this before but no one has taken me up on it... http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery If we take it as a given that you can do multiple ranges at the same position, then you can imagine supporting all of your regular hours using just two fields (open and close) by encoding the day+time of each range of open hours into them -- even if a store is open for multiple sets of ranges per day (ie: closed for siesta)... open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... then asking for stores open now and for the next 3 hours on wed at 2:13PM becomes a query for... sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) For the special case part of your problem when there are certain dates that a store will be open atypical hours, i *think* that could be solved using some special docs and the new join QParser in a filter query... https://wiki.apache.org/solr/Join imagine you have your regular docs with all the normal data about a store, and the open/close fields i describe above. but in addition to those, for any store that you know is closed on dec 25 or only open 12:00-15:00 on Jan 01 you add an additional small doc encapsulating the information about the stores closures on that special date - so that each special case would be it's own doc, even if one store had 5 days where there was a special case... specialdoc1: store_id: 42 special_date: Dec-25 status: closed specialdoc2: store_id: 42 special_date: Jan-01 status: irregular open: 09_30 close: 13_00 then when you are executing your query, you use an fq to constrain to stores that are (normally) open right now (like i mentioned above) and you use another fq to find all docs *except* those resulting from a join against these special case docs based on the current date. so if you r query is open now and for the next 3 hours and now == sunday, 2011-12-25 @ 10:17AM your query would be something like... q=...user input... time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *]) fq={!v=time} fq={!join from=store_id to=unique_key v=$vv} vv=-(+special_date:Dec-25 +(status:closed OR _query_:{v=$time})) That join based approach for dealing with the special dates should work regardless of wether someone implements a way to do pair wise sameposition() rangequeries ... so if you can live w/o the multiple open/close pairs per day, you can just use the one field per day of hte week type approach you mentioned combined with the join for special case days of hte year and everything you need should already work w/o any code (on trunk). (disclaimer: obviously i haven't tested that query, the exact syntax may be off but the princible for modeling the special docs and using them in a join should work) -Hoss --
Re: multiple dateranges/timeslots per doc: modeling openinghours.
Thanks Hoss for that in-depth walkthrough. I like your solution of using (something akin to) FieldMaskingSpanQueryhttps://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html. Conceptually the Join-approach looks like it would work from paper, although I'm not a big fan of introducing a lot of complexity to the frontend / querying part of the solution. As an alternative, what about using your fieldMaskingSpanQuery-approach solely (without the JOIN-approach) and encode open/close on a per day basis? I didn't mention it, but I 'only' need 100 days of data, which would lead to 100 open and 100 close values, not counting the pois with multiple openinghours per day which are pretty rare. The index is rebuild each night, refreshing the date-data. I'm not sure what the performance implications would be like, but somehow that feels doable. Perhaps it even offsets the extra time needed for doing the Joins, only 1 way to find out I guess. Disadvantage would be fewer cache-hits when using FQ. Data then becomes: open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ... close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ... Notice the: 20111021_26_30, which indicates close at 2AM the next day, which would work (in contrast to encoding it like 20111022_02_30) Alternatively, how would you compare your suggested approach with the approach by David Smiley using either SOLR-2155 (Geohash prefix query filter) or LSP: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244. That would work right now, and the LSP-approach seems pretty elegant to me. FQ-style caching is probably not possible though. Geert-Jan Op 1 oktober 2011 04:25 schreef Chris Hostetter hossman_luc...@fucit.orghet volgende: : Another, faulty, option would be to model opening/closing hours in 2 : multivalued date-fields, i.e: open, close. and insert open/close for each : day, e.g: : : open: 2011-11-08:1800 - close: 2011-11-09:0300 : open: 2011-11-09:1700 - close: 2011-11-10:0500 : open: 2011-11-10:1700 - close: 2011-11-11:0300 : : And queries would be of the form: : : 'open now close now+3h' : : But since there is no way to indicate that 'open' and 'close' are pairwise : related I will get a lot of false positives, e.g the above document would be : returned for: This isn't possible out of the box, but the general idea of position linked queries is possible using the same approach as the FieldMaskingSpanQuery... https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html https://issues.apache.org/jira/browse/LUCENE-1494 ..implementing something like this that would work with (Numeric)RangeQueries however would require some additional work, but it should certianly be doable -- i've suggested this before but no one has taken me up on it... http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery If we take it as a given that you can do multiple ranges at the same position, then you can imagine supporting all of your regular hours using just two fields (open and close) by encoding the day+time of each range of open hours into them -- even if a store is open for multiple sets of ranges per day (ie: closed for siesta)... open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... then asking for stores open now and for the next 3 hours on wed at 2:13PM becomes a query for... sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) For the special case part of your problem when there are certain dates that a store will be open atypical hours, i *think* that could be solved using some special docs and the new join QParser in a filter query... https://wiki.apache.org/solr/Join imagine you have your regular docs with all the normal data about a store, and the open/close fields i describe above. but in addition to those, for any store that you know is closed on dec 25 or only open 12:00-15:00 on Jan 01 you add an additional small doc encapsulating the information about the stores closures on that special date - so that each special case would be it's own doc, even if one store had 5 days where there was a special case... specialdoc1: store_id: 42 special_date: Dec-25 status: closed specialdoc2: store_id: 42 special_date: Jan-01 status: irregular open: 09_30 close: 13_00 then when you are executing your query, you use an fq to constrain to stores that are (normally) open right now (like i mentioned above) and you use another fq to find all docs *except* those resulting from a join against these special case docs based on the current date. so if you r query is open now and for the next 3 hours and now == sunday, 2011-12-25 @ 10:17AM your query would be something like... q=...user input... time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *]) fq={!v=time}
Re: copyField destination does not exist
The error is saying you have a copyfield-directive in schema.xml that wants to copy the value of a field to the destination field 'text' that doesn't exist (which indeed is the case given your supplied fields) Search your schema.xml for 'copyField'. There's probably something configured related to copyfield functionality that you don't want. Perhaps you de-commented the copyfield-portion of schema.xml by accident? hth, Geert-Jan 2011/3/28 Merlin Morgenstern merli...@fastmail.fm Hi there, I am trying to get solr indexing mysql tables. Seems like I have misconfigured schema.xml: HTTP ERROR: 500 Severe errors in solr configuration. - org.apache.solr.common.SolrException: copyField destination :'text' does not exist at org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:685) My config looks like this: fields field name=id type=string indexed=true stored=true required=true/ field name=phrase type=text indexed=true stored=true required=true/ field name=country type=text indexed=true stored=true required=true/ /fields uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldphrase/defaultSearchField What is wrong within this config? The type schould be OK. -- http://www.fastmail.fm - Choose from over 50 domains or use your own
Re: working with collection : Where is default schema.xml
Changing the default schema.xml to what you want is the way to go for most of us. It's a good learning experience as well, since it contains a lot of documentation about the options that may be of interest to you. Cheers, Geert-Jan 2011/3/22 geag34 sac@gmail.com Ok thank. It is my fault. I have created collection with a lucidimagination perl script. I will errase the schema.xml. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/working-with-collection-Where-is-default-schema-xml-tp2700455p2712496.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Adding the suggest component
2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983 Solr started on port 8983 instead of this: http://localhost/solr/admin/ try this instead: http://localhost:8983/solr/admin/ http://localhost/solr/admin/ Cheers, Geert-Jan 2011/3/18 Brian Lamb brian.l...@journalexperts.com That does seem like a better solution. I downloaded a recent version and there were the following files/folders: build.xml dev-tools LICENSE.txt lucene NOTICE.txt README.txt solr So I did cp -r solr/* /path/to/solr/stuff/ and started solr. I didn't get any error message but I only got the following messages: 2011-03-18 14:11:02.016:INFO::Logging to STDERR via org.mortbay.log.StdErrLog 2011-03-18 14:11:02.240:INFO::jetty-6.1-SNAPSHOT 2011-03-18 14:11:02.284:INFO::Started SocketConnector@0.0.0.0:8983 Where as before I got a bunch of messages indicating various libraries had been loaded. Additionally, when I go to http://localhost/solr/admin/, I get the following message: HTTP ERROR: 404 Problem accessing /solr/admin. Reason: NOT_FOUND What did I do incorrectly? Thanks, Brian Lamb On Fri, Mar 18, 2011 at 9:04 AM, Erick Erickson erickerick...@gmail.com wrote: What do you mean you copied the contents...to the right place? If you checked out trunk and copied the files into 1.4.1, you have mixed source files between disparate versions. All bets are off. Or do you mean jar files? or??? I'd build the source you checked out (at the Solr level) and use that rather than try to mix-n-match. BTW, if you're just starting (as in not in production), you may want to consider using 3.1, as it's being released even as we speak and has many improvements over 1.4. You can get a nightly build from here: https://builds.apache.org/hudson/view/S-Z/view/Solr/ Best Erick On Thu, Mar 17, 2011 at 3:36 PM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, When I installed Solr, I downloaded the most recent version (1.4.1) I believe. I wanted to implement the Suggester ( http://wiki.apache.org/solr/Suggester). I copied and pasted the information there into my solrconfig.xml file but I'm getting the following error: Error loading class 'org.apache.solr.spelling.suggest.Suggester' I read up on this error and found that I needed to checkout a newer version from SVN. I checked out a full version and copied the contents of src/java/org/apache/spelling/suggest to the same location on my set up. However, I am still receiving this error. Did I not put the files in the right place? What am I doing incorrectly? Thanks, Brian Lamb
Re: Solr query POST and not in GET
Yes it's possible. Assuming your using SolrJ as a client-library: set: QueryRequest req = new QueryRequest(); req.setMethod(METHOD.POST); Any other client-library should have a similar method. hth, Geert-Jan 2011/3/15 Gastone Penzo gastone.pe...@gmail.com Hi, is possible to change Solr sending query method from get to post? because my query has a lot of OR..OR..OR and the log says to me Request URI too large Where can i change it?? thanx -- Gastone Penzo www.solr-italia.it The first italian blog about SOLR
Re: Solr Query
But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at all. I believe you mean: 'it returns all results with RetailPriceCodeID = 1 while ignoring the 2nd query?' If so, please check that your default operator is set to AND in your schema config. Other than that, your syntax seems correct. Hth, Geert-Jan 2011/3/15 Vishal Patel lin...@gmail.com I am a bit new for Solr. I am running below query in query browser admin interface +RetailPriceCodeID:1 +MSRP:[16001.00 TO 32000.00] I think it should return only results with RetailPriceCode = 1 ad MSRP between 16001 and 32000. But it returns all resuts with MSRP = 1 and doesnt consider 2nd query at all. Am i doing something wrong here? Please help
Re: Solr and Permissions
Ahh yes, sorry about that. I assumed ExternalFileField would work for filtering as well. Note to self: never assume Geert-Jan 2011/3/12 Koji Sekiguchi k...@r.email.ne.jp (11/03/12 10:28), go canal wrote: Looking at the API doc, it seems that only floating value is currently supported, is it true? Right. And it is just for changing score by using float values in the file, so it cannot be used for filtering. Koji -- http://www.rondhuit.com/en/
Re: Getting Category ID (primary key)
If it works, it's performant and not too messy it's a good way :-) . You can also consider just faceting on Id, and use the id to fetch the categoryname through sql / nosql. That way your logic is seperated from your presentation, which makes extending (think internationalizing, etc.) easier. Not sure if that's appropriate for your 'category' field but anyway. I belief you were asking this because you already had 2 multivalued fields: 'id' and 'category' which you wanted to reuse for this particular use-case. In short: you can't link a particular value in a multivalued field (e.g: 'id') to a particular value in another multivalued field (e.g: 'category'), so just give up this route, and go with what you had, or use the suggested above. hth, Geert-Jan 2011/3/11 Prav Buz buz.p...@gmail.com Hi, Thanks Erik, yes that's what I've done for now, but was wondering if it's the best way :) thanks Praveen On Fri, Mar 11, 2011 at 6:06 PM, Erick Erickson erickerick...@gmail.com wrote: Thinking out loud here, but would it work to just have ugly categories? Instead of splitting them up, just encode them like 1|a 2|b 3|c or some such. Then split them back up again and display the name to the user and use the ID in the URL Best Erick On Fri, Mar 11, 2011 at 4:17 AM, Prav Buz buz.p...@gmail.com wrote: Hi, Yes I already have different fields for category and category Id , and they are in same order when retrieved from solr for eg: IDs 1 3 4 5 names a b c d e id 1 is of name a and id 5 is of name e. but when I sort the category names , looses this order as they are not related in any manner in the solr docs. Thanks Praveen On Fri, Mar 11, 2011 at 2:35 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, Mar 11, 2011 at 2:32 PM, Prav Buz buz.p...@gmail.com wrote: [...] I need to show a facets on Category and then I need the category id in the href link. For this what I 'm trying to do is create a field which will store ID|Category in the schema and split it in the UI. Also I have Category and category id 's indexed . [...] Why not have two different fields for category, and for category ID? Regards, Gora
Re: Solr and Permissions
About the 'having to reindex when permissions change'-problem: have a look at ExternalFileField http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.htmlwhich enables you to reload a file without having to reindex all the documents. Thinking out loud: multivalued field 'roles' of type ExternalFileField. - assign each person 1 or multiple roles. - each document has multiple roles assigned to it (which are entitled to view it) Not sure if it (the ExternalFileField approach) scales though. Geert-Jan 2011/3/11 Bill Bell billnb...@gmail.com Why not just add a security field in Solr and use fq to limit to the users permissions? Bill Bell Sent from mobile On Mar 11, 2011, at 10:27 AM, Walter Underwood wun...@wunderwood.org wrote: On Mar 10, 2011, at 10:48 PM, go canal wrote: But in real world, any content management systems need full text search; so the question is to how to support search with permission control. I have yet to see a Search Engine that provides some sort of Content Management features like we are discussing here (Solr, Elastic Search ?) It isn't free, but MarkLogic can do this. It is an XML database with security support and search. Changing permissions is an update transaction, not a reload. Permissions can be part of a search, just like any other constraint. The search is not the usual crappy search you get in a database. MarkLogic is built with search engine technology, so the search is fast and good. We do offer a community license for personal, not-for-profit use. See details here: http://developer.marklogic.com/licensing wunder -- Walter Underwood Lead Engineer, MarkLogic
Re: Solr
Start by reading http://wiki.apache.org/solr/FrontPage and the provided links (introduction, tutorial, etc. ) 2011/3/10 yazhini.k vini yazhini@gmail.com Hi , I need notes and detail about solr because of Now I am working in solr so i need help . Regards , Yazhini . K NCSI , M.Sc ( Software Engineering ) .
Re: how would you design schema?
Would having a solr-document represent a 'product purchase per account' solve your problem? You could then easily link the date of purchase to the document as well as the account-number. e.g: fields: orderid (key), productid, product-characteristics, order-characteristics (including date of purchase). or in case of option of multiple products having a joined orderid: fields: cat(orderid,productid) (key), orderid, productid, product-characteristics, order-characteristics (including date of purchase). The difference to your setup (i.e: one document per account) is that the suggested setup above may return multiple documents when you search by account-nr, which may or may not be what you're after. hth, Geert-Jan 2011/3/9 dan whelan d...@adicio.com Hi, I'm investigating how to set up a schema like this: I want to index accounts and the products purchased (multiValued) by that account but I also need the ability to search by the date the product was purchased. It would be easy if the purchase date wasn't part of the requirements. How would the schema be designed? Is there a better approach? Thanks, Dan
Re: [ANNOUNCE] Web Crawler
Hi Dominique, This looks nice. In the past, I've been interested in (semi)-automatically inducing a scheme/wrapper from a set of example webpages (often called 'wrapper induction' is the scientific field) . This would allow for fast scheme-creation which could be used as a basis for extraction. Lately I've been looking for crawlers that incoporate this technology but without success. Any plans on incorporating this? Cheers, Geert-Jan 2011/3/2 Dominique Bejean dominique.bej...@eolya.fr Rosa, In the pipeline, there is a stage that extract the text from the original document (PDF, HTML, ...). It is possible to plug scripts (Java 6 compliant) in order to keep only relevant parts of the document. See http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage Dominique Le 02/03/11 09:36, Rosa (Anuncios) a écrit : Nice job! It would be good to be able to extract specific data from a given page via XPATH though. Regards, Le 02/03/2011 01:25, Dominique Bejean a écrit : Hi, I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web Crawler. It includes : * a crawler * a document processing pipeline * a solr indexer The crawler has a web administration in order to manage web sites to be crawled. Each web site crawl is configured with a lot of possible parameters (no all mandatory) : * number of simultaneous items crawled by site * recrawl period rules based on item type (html, PDF, …) * item type inclusion / exclusion rules * item path inclusion / exclusion / strategy rules * max depth * web site authentication * language * country * tags * collections * ... The pileline includes various ready to use stages (text extraction, language detection, Solr ready to index xml writer, ...). All is very configurable and extendible either by scripting or java coding. With scripting technology, you can help the crawler to handle javascript links or help the pipeline to extract relevant title and cleanup the html pages (remove menus, header, footers, ..) With java coding, you can develop your own pipeline stage stage The Crawl Anywhere web site provides good explanations and screen shots. All is documented in a wiki. The current version is 1.1.4. You can download and try it out from here : www.crawl-anywhere.com Regards Dominique
Re: Efficient boolean query
If you often query X as part of several other queries (e.g: X | X AND Y | X AND Z) you might consider putting X in a filter query ( http://wiki.apache.org/solr/CommonQueryParameters#fq) leading to: q=*:*fq=X q=Yfq=X q=Zfq=X Filter queries are cached seperately which means that after the first query involving X, X should be returned quickly. So your FIRST query will probably still be in the 'few seconds'- range, but all following queries involving X will return much quicker. hth, Geert-Jan 2011/3/2 Ofer Fort ofer...@gmail.com Hey all, I have an index with a lot of documents with the term X and no documents with the term Y. If i query for X it take a few seconds and returns the results. If I query for Y it takes a millisecond and returns an empty set. If i query for Y AND X it takes a few seconds and returns an empty set. I'm guessing that it evaluate both X and Y and only then tries to intersect them? Am i wrong? is there another way to run this query more efficiently? thanks for any input
Re: Problem with sorting using functions.
sort by functionquery is only available from solr 3.1 (from : http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function) 2011/2/28 John Sherwood j...@storecrowd.com This works: /select/?q=*:*sort=price desc This throws a 400 error: /select/?q=*:*sort=sum(1, 1) desc Missing sort order. I'm using 1.4.2. I've tried all sorts of different numbers, functions, and fields but nothing seems to change that error. Any ideas?
Re: Sort Stability With Date Boosting and Rounding
You could always use a secondary sort as a tie-breaker, i.e: something unique like 'documentid' or something. That would ensure a stable sort. 2011/2/23 Stephen Duncan Jr stephen.dun...@gmail.com I'm trying to use http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents as a bf parameter to my dismax handler. The problem is, the value of NOW can cause documents in a similar range (date value within a few seconds of each other) to sometimes round to be equal, and sometimes not, changing their sort order (when equal, falling back to a secondary sort). This, in turn, screws up paging. The problem is that score is rounded to a lower level of precision than what the suggested formula produces as a difference between two values within seconds of each other. It seems to me if I could round the value to minutes or hours, where the difference will be large enough to not be rounded-out, then I wouldn't have problems with order changing on me. But it's not legal syntax to specify something like: recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) Is this a problem anyone has faced and solved? Anyone have suggested solutions, other than indexing a copy of the date field that's rounded to the hour? -- Stephen Duncan Jr www.stephenduncanjr.com
Re: Index Not Matching
Make sure your index is completely commited. curl 'http://localhost:8983/solr/update?commit=true' http://wiki.apache.org/solr/UpdateXmlMessages#A.22commit.22_and_.22optimize.22 for an overview: http://lucene.apache.org/solr/tutorial.html hth, Geert-Jan http://techgurulive.com/2010/11/22/apache-solr-commit-and-optimize/ 2011/2/3 Esclusa, Will william.escl...@bonton.com Both the application and the SOLR gui match (with the incorrect number of course :-) ) At first I thought it could be a schema problem, but we went though it with a fine comb and compared it to the one in our stage environment. What is really weird is that I grabbed one of the product ID that are not showing up in SOLR from the DB, search through the SOLR GUI and it found it. -Original Message- From: Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] Sent: Thursday, February 03, 2011 4:57 PM To: solr-user@lucene.apache.org Subject: Re: Index Not Matching that's odd..are you viewing the results through your application or the admin console? if you aren't, I'd suggest you use the admin console just to eliminate the possibility of an application bug. We had a similar problem in the past and turned out to be a mixup of our dev/test instances.. On 3 February 2011 21:41, Esclusa, Will william.escl...@bonton.com wrote: Hello Saavs, I am 100% sure we are not updating the DB after we index the data. We are specifying the same fields on both queries. Our prod boxes do not have access to QA or DEV, so I would expect a connection error when indexing if this is the case. No connection errors in the logs. -Original Message- From: Savvas-Andreas Moysidis [mailto:savvas.andreas.moysi...@googlemail.com] Sent: Thursday, February 03, 2011 4:26 PM To: solr-user@lucene.apache.org Subject: Re: Index Not Matching Hello, Are you definitely positive your database isn't updated after you index your data? Are you querying against the same field(s) specifying the same criteria both in Solr and in the database? Any chance you might be pointing to a dev/test instance of Solr ? Regards, - Savvas On 3 February 2011 20:17, Esclusa, Will william.escl...@bonton.com wrote: Greetings! My organization is new to SOLR, so please bare with me. At times, we experience an out of sync condition between SOLR index files and our Database. We resolved that by clearing the index file and performing a full crawl of the database. Last time we noticed an out of sync condition, we went through our procedure of deleting and crawling, but this time it did not fix it. For example, search for swim on the DB and we get 440 products, but yet SOLR states we have 214 products. Has anyone experience anything like this? Does anyone have any suggestions on a trace we can turn on? Again, we are new to SOLR so any help you can provide is greatly appreciated. Thanks! Will
Re: Faceting Question
fq={!tag=tag1}tags:( |1003| |1007|) AND tags:( |10015|)version=2.2start=0rows=10indent=onfacet=onfacet.field={!ex=tag1}categoryfacet.field=capacityfacet.field=brand I'm just guessing here, but perhaps {!tag=tag1} is only picking up the 'tags:( |1003| |1007|) '-part. If so {!ex=tag1} would only exclude 'tags:( |1003| |1007|) ' but it wouldn't exclude ' tags:( |10015|)' I believe this would 100% explain what you're seeing. Assuming my guess is correct you could try to a couple of things (none of which I'm absolutely certain will work, but you could try it out easily): 1. put fq in quotes: fq={!tag=tag1}tags:( |1003| |1007|) AND tags:(|10015|) -- this might instruct {!tag=tag1} to tag the whole fq-filter. 2. make multiple fq's, and exclude them all (not sure if you can exclude multiple fields): fq={!tag=tag1}tags:( |1003| |1007|)fq={!tag=tag2}tags:( |10015|)facet.field={!ex=tag1,tag2}category... hth, Geert-Jan 2011/1/24 beaviebugeater mbro...@cox.net I am attempting to do facets on products similar to how hayneedle does it on their online stores (they do NOT use Solr). See: http://www.clockstyle.com/wall-clocks/antiqued/1359+1429+4294885075.cfm So simple example, my left nav might contain categories and 2 attributes, brand and capacity: Categories - Cat1 (23) selected - Cat2 (16) - Cat3 (5) Brand -Brand1 (18) -Brand2 (10) -Brand3 (0) Capacity -Capacity1 (14) -Capacity2 (9) Each category or attribute value is represented with a checkbox and can be selected or deselected. The initial entry into this page has one category selected. Other categories can be selected which might change the number of products related to each attribute value. The number of products in each category never changes. I should also be able to select one or more attribute. Logically this would look something like: (Cat1 Or Cat2) AND (Value1 OR Value2) AND (Value4) Behind the scenes I have each category and attribute value represented by a tag, which is just a numeric value. So I search on the tags field only and then facet on category, brand and capacity fields which are stored separately. My current Solr query ends up looking something like: fq={!tag=tag1}tags:( |1003| |1007|) AND tags:( |10015|)version=2.2start=0rows=10indent=onfacet=onfacet.field={!ex=tag1}categoryfacet.field=capacityfacet.field=brand This shows 2 categories being selected (1003 and 1007) and one attribute value (10015). This partially works - the categories work fine. The problem is, if I select, say a brand attribute (as in the above example the 10015 tag) it does filter to the selected categories AND the selected attribute BUT I'm not able to broaden the search by selecting another attribute value. I want to display of products to be filtered to what I select, but I want to be able to broaden the filter without having to back up. I feel like I'm close but still missing something. Is there a way to specify 2 tags that should be excluded from facet fields? I hope this example makes sense. Any help greatly appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Faceting-Question-tp2320542p2320542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: one last questoni on dynamic fields
Yep you can. Although I'm not sure you can use a wildcard-prefix. (perhaps you can I'm just not sure) . I always use wildcard-suffixes. Cheers, Geert-Jan 2011/1/23 Dennis Gearon gear...@sbcglobal.net Is it possible to use ONE definition of a dynamic field type for inserting mulitple dynamic fields of that type with different names? Or do I need a seperate dynamic field definition for each eventual field? Can I do this? in schema.xml field name=ALL_OTHER_STANDARD_FILEDS type=OTHER_TYPES indexed=SOME_TIMES stored=USUALLY/ dynamicField name=*_i type=intindexed=true stored=true/ . . /in schema.xml and then doing for insert add doc field name=ALL_OTHER_STANDARD_FILEDSall their valuesfield field name=customA_i9802490824908field field name=customB_i9809084field field name=customC_i09845970011field field name=customD_i09874523459870field /doc /add Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Search on two core and two schema
Schemas are very differents, i can't group them. In contrast to what you're saying above, you may rethink the option of combining both type of documents in a single core. It's a perfectly valid approach to combine heteregenous documents in a single core in Solr. (and use a specific field -say 'type'- to distinguish between them when needed) Geert-Jan 2011/1/18 Jonathan Rochkind rochk...@jhu.edu Solr can't do that. Two cores are two seperate cores, you have to do two seperate queries, and get two seperate result sets. Solr is not an rdbms. On 1/18/2011 12:24 PM, Damien Fontaine wrote: I want execute this query : Schema 1 : field name=id type=string indexed=true stored=true required=true / field name=title type=string indexed=true stored=true required=true / field name=UUID_location type=string indexed=true stored=true required=true / Schema 2 : field name=UUID_location type=string indexed=true stored=true required=true / field name=label type=string indexed=true stored=true required=true / field name=type type=string indexed=true stored=true required=true / Query : select?facet=truefl=titleq=title:*facet.field=UUID_locationrows=10qt=standard Result : ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=facettrue/str str name=fltitle/str str name=qtitle:*/str str name=facet.fieldUUID_location/str str name=qtstandard/str /lst /lst result name=response numFound=1889 start=0 doc str name=titletitre 1/str /doc doc str name=titleTitre 2/str /doc /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=UUID_location int name=Japan998/int int name=China891/int /lst /lst lst name=facet_dates/ /lst /response Le 18/01/2011 17:55, Stefan Matheis a écrit : Okay .. and .. now .. you're trying to do what? perhaps you could give us an example, w/ real data .. sample queries - results. because actually i cannot imagine what you want to achieve, sorry On Tue, Jan 18, 2011 at 5:24 PM, Damien Fontainedfonta...@rosebud.fr wrote: On my first schema, there are informations about a document like title, lead, text etc and many UUID(each UUID is a taxon's ID) My second schema contains my taxonomies with auto-complete and facets. Le 18/01/2011 17:06, Stefan Matheis a écrit : Search on two cores but combine the results afterwards to present them in one group, or what exactly are you trying to do Damien? On Tue, Jan 18, 2011 at 5:04 PM, Damien Fontainedfonta...@rosebud.fr wrote: Hi, I would like make a search on two core with differents schemas. Sample : Schema Core1 - ID - Label - IDTaxon ... Schema Core2 - IDTaxon - Label - Hierarchy ... Schemas are very differents, i can't group them. Have you an idea to realize this search ? Thanks, Damien
Re: Sub query using SOLR?
Bbarani probably wanted to be able to create the query without having to prefetch the ids at the clientside first. But I agree this is the only stable solution I can think of (so excluding possible patches) Geert-Jan 2011/1/5 Grijesh.singh pintu.grij...@gmail.com Why thinking so complex,just use result of first query as filter for your second query like fq=related_id:(id1 OR id2 OR id3 )q=q=”type:IT AND manager_12:dave” somthing like that - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Sub-query-using-SOLR-tp2193251p2197490.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Consequences for using multivalued on all fields
You should be aware that the behavior of sorting on a multi-valued field is undefined. After all, which of the multiple values should be used for sorting? So if you need sorting on the field, you shouldn't make it multi-valued. Geert-Jan 2010/12/21 J.J. Larrea j...@panix.com Someone please correct me if I am wrong, but as far as I am aware index format is identical in either case. One benefit of allowing one to specify a field as single-valued is similar to specifying that a field is required: Providing a safeguard that index data conforms to requirements. So making all fields multivalued forgoes that integrity check for fields which by definition should be singular. Also depending on the response writer and for the XMLResponseWriter the requested response version (see http://wiki.apache.org/solr/XMLResponseFormat) the multi-valued setting can determine whether the document values returned from a query will be scalars (eg. str name=year2010/str) or arrays of scalars (arr name=yearstr2010/str/arr), regardless of how many values are actually stored. But the most significant gotcha of not specifying the actual arity (1 or N) arises if any of those fields is used for field-faceting: By default the field-faceting logic chooses a different algorithm depending on whether the field is multi-valued, and the default choice for multi-valued is only appropriate for a small set of enumerated values since it creates a filter query for each value in the set. And this can have a profound effect on Solr memory utilization. So if you are not relying on the field arity setting to select the algorithm, you or your users might need to specify it explicitly with the f.field.facet.method argument; see http://wiki.apache.org/solr/SolrFacetingOverview for more info. So while all-multivalued isn't a showstopper, if it were up to me I'd want to give users the option to specify arity and whether the field is required. - J.J. At 2:13 PM +0100 12/21/10, Tim Terlegård wrote: In our application we use dynamic fields and there can be about 50 of them and there can be up to 100 million documents. Are there any disadvantages having multivalued=true on all fields in the schema? An admin of the application can specify dynamic fields and if they should be indexed or stored. Question is if we gain anything by letting them to choose multivalued as well or if it just adds complexity to the user interface? Thanks, Tim
Re: Search based on images
Well-known algorithms for detecting 'highly descriptive features' in images that can cope with scaling and rotation (up to a certain degree of course) are SIFT and SURF (SURF is generally considered the more mature of the two afaik) http://en.wikipedia.org/wiki/Scale-invariant_feature_transform http://en.wikipedia.org/wiki/SURF http://en.wikipedia.org/wiki/SURFthat link comes with links to the original papers as well as a list of open-source implementations, e.g: http://code.google.com/p/javasurf/ http://code.google.com/p/javasurf/I don't have experience with the open-source code myself, and you probably have to make a similiary-like method based on the more low-level methods that implement these algorithms. So this is perhaps a more 'down in the trenches' -approach, but at least it should give you some solid background on how this is done. Geert-Jan 2010/12/11 Dennis Gearon gear...@sbcglobal.net Tried one, of Perry Mason's secretary when she was young (and HOOOT), Barbara Hale. http://www.skylighters.org/ggparade/index8.html Didn't find it. 1.8 billion images indexed is probably a DROP in the bucket of what's out there. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Dennis Gearon gear...@sbcglobal.net To: solr-user@lucene.apache.org Sent: Fri, December 10, 2010 9:24:53 PM Subject: Re: Search based on images Threre is actually some image recognition search engine software somewhere I heard about. Take a picture of something, say a poster, upload it, and it will adjust for some lighting/angle/distortion, and try to find it on the web somewhere. You hear about crazy stuff like this at dev camps. Basically, handme downs from Homeland Security and the military ;-) Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: finding exact case insensitive matches on single and multiword values
when you went from strField to TextField in your config you enabled tokenizing (which I believe splits on spaces by default), which is why you see seperate 'words' / terms in the debugQuery-explanation. I believe you want to keep your old strField config and try quoting: fq=city:den+haag or fq=city:den haag Concerning the lower-casing: wouldn't if be easiest to do that at the client? (I'm not sure at the moment how to do lowercasing with a strField) . Geert-jan 2010/12/3 PeterKerk vettepa...@hotmail.com You are right, this is what I see when I append the debug query (very very useful btw!!!) in old situation: arr name=parsed_filter_queries strcity:den title:haag/str strPhraseQuery(themes:hotel en restaur)/str /arr I then changed the schema.xml to: fieldType name=myField class=solr.TextField sortMissingLast=true omitNorms=true analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=city type=myField indexed=true stored=true/ !-- used to be string -- I then tried adding parentheses: http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den+haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city also tried (without +): http://localhost:8983/solr/db/select/?indent=onfacet=truefq=city:(den haag)q=*:*start=0rows=25fl=id,title,friendlyurl,cityfacet.field=city Then I get: arr name=parsed_filter_queries strcity:den city:haag/str /arr And still 0 results But as you can see the query is split up into 2 separate words, I dont think that is what I need? -- View this message in context: http://lucene.472066.n3.nabble.com/finding-exact-case-insensitive-matches-on-single-and-multiword-values-tp2012207p2012509.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: schema design for related fields
if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) What would you intend to do with the returned facet-results in this situation? I doubt you want to display 12 categories (1 for each month) ? When a user hasn't selected a date, perhaps it would be more useful to show the cheapest fare regardless of month and facet on that? This would involve introducing 2 new fields: FareDateDontCareStandard, FareDateDontCareFirst Populate these fields on indexing time, by calculating the cheapest fares over all months. This then results in every query having to support at most 20 price ranges (10 for normal and 10 for first class) HTH, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected with 10 price ranges query would be 10 cluases if we required facets to be returned for all price combinations we'd need to supply 240 cluases the user interface would also need to collate the individual fields into meaningful aggragates for the user (ie numbers by month, numbers by fare class) have I understood or missed the point (i usually have) On 1 December 2010 15:00, Erick Erickson erickerick...@gmail.com wrote: I'd think that facet.query would work for you, something like: facet=truefacet.query=FareJanStandard:[price1 TO price2]facet.query:fareJanStandard[price2 TO price3] You can string as many facet.query clauses as you want, across as many fields as you want, they're all independent and will get their own sections in the response. Best Erick On Wed, Dec 1, 2010 at 4:55 AM, lee carroll lee.a.carr...@googlemail.com wrote: Hi I've built a schema for a proof of concept and it is all working fairly fine, niave maybe but fine. However I think we might run into trouble in the future if we ever use facets. The data models train destination city routes from a origin city: Doc:City Name: cityname [uniq key] CityType: city type values [nine possible values so good for faceting] ... [other city attricbutes which relate directy to the doc unique key] all have limited vocab so good for faceting FareJanStandard:cheapest standard fare in january(float value) FareJanFirst:cheapest first class fare in january(float value) FareFebStandard:cheapest standard fare in feb(float value) FareFebFirst:cheapest first fare in feb(float value) . etc The question is how would i best facet fare price? The desire is to return number of citys with jan prices in a set of ranges etc number of citys with first prices in a set of ranges etc install is 1.4.1 running in weblogic Any ideas ? Lee C
Re: schema design for related fields
Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query would be 20 cluases in the query (10 * 2 fare clases) if first is selected in the user interface and we have 10 price ranges query would be 120 cluases (12 months * 10 price ranges) if first and jan selected
Re: schema design for related fields
Also, filtering and sorting on price can be done as well. Just be sure to use the correct price- field. Geert-Jan 2010/12/1 Geert-Jan Brits gbr...@gmail.com Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although that's not often done there's no prohibition against it. So, what if you had just one field for each city that contained some abstract information about your fares etc. Something like janstdfareclass1 jancheapfareclass3 febstdfareclass6 Now just facet on that field? Not #values# in that field, just the field itself. You'd then have to make those into human-readable text, but that would considerably simplify your query. Probably only works if your user is selecting from pre-defined ranges, if they expect to put in arbitrary ranges this scheme probably wouldn't work... Best Erick On Wed, Dec 1, 2010 at 10:22 AM, lee carroll lee.a.carr...@googlemail.comwrote: Hi Erick, so if i understand you we could do something like: if Jan is selected in the user interface and we have 10 price ranges query
Re: schema design for related fields
Indeed, selecting the best price for January OR April OR November and sorting on it isn't possible with this solution (if that's what you mean). However, any combination of selecting 1 month and/or 1 price-range and/or 1 fare-type IS possible. 2010/12/1 lee carroll lee.a.carr...@googlemail.com Hi Geert, Ok I think I follow. the magic is in the multi-valued field. The only danger would be complexity if we allow users to multi select months/prices/fare classes. For example they can search for first prices in jan, april and november. I think what you describe is possible in this case just complicated. I'll see if i can hack some facets into the proto type tommorrow. Thanks for your help Lee C On 1 December 2010 17:57, Geert-Jan Brits gbr...@gmail.com wrote: Ok longer answer than anticipated (and good conceptual practice ;-) Yeah I belief that would work if I understand correctly that: 'in Jan [9] in feb [10] in march [1]' has nothing to do with pricing, but only with availability? If so you could seperate it out as two seperate issues: 1. ) showing pricing (based on context) 2. ) showing availabilities (based on context) For 1.) you get 39 pricefields ([jan,feb,..,dec,dc] * [standard,first,dc]) note: 'dc' indicates 'don't care. depending on the context you query the correct pricefield to populate the price facet-values. for discussion lets call the fields: _p[fare][date]. IN other words the price field for no preference at all would become: _pdcdc For 2.) define a multivalued field 'FaresPerDate 'which indicate availability, which is used to display: A) Standard fares [10] First fares [3] B) in Jan [9] in feb [10] in march [1] A) depends on your selection (or dont caring) about a month B) vice versa depends on your selection (or dont caring) about a fare type given all possible date values: [jan,feb,..dec,dontcare] given all possible fare values:[standard,first,dontcare] FaresPerDate consists of multiple values per document where each value indicates the availability of a combination of 'fare' and 'date': (standardJan,firstJan,DCjan...,standardJan,firstDec,DCdec,standardDC,firstDC,DCDC) Note that the nr of possible values = 39. Example: 1. ) the user hasn't selected any preference: q=*:*facet.field:FaresPerDatefacet.query=_pdcdc:[0 TO 20]facet.query=_pdcdc:[20 TO 40], etc. in the client you have to make sure to select the correct values of 'FaresPerDate' for display: in this case: Standard fares [10] -- FaresPerDate.standardDC First fares [3] -- FaresPerDate.firstDC in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch 2) the user has selected January q=*:*facet.field:FaresPerDatefq=FaresPerDate:DCJanfacet.query=_pDCJan:[0 TO 20]facet.query=_pDCJan:[20 TO 40] Standard fares [10] -- FaresPerDate.standardJan First fares [3] -- FaresPerDate.firstJan in Jan [9] - FaresPerDate.DCJan in feb [10] - FaresPerDate.DCFeb in march [1]- FaresPerDate.DCMarch Hope that helps, Geert-Jan 2010/12/1 lee carroll lee.a.carr...@googlemail.com Sorry Geert missed of the price value bit from the user interface so we'd display Facet price Standard fares [10] First fares [3] When traveling in Jan [9] in feb [10] in march [1] Fare Price 0 - 25 : [20] 25 - 50: [10] 50 - 100 [2] cheers lee c On 1 December 2010 17:00, lee carroll lee.a.carr...@googlemail.com wrote: Geert The UI would be something like: user selections for the facet price max price: £100 fare class: any city attributes facet cityattribute1 etc: xxx results displayed something like Facet price Standard fares [10] First fares [3] in Jan [9] in feb [10] in march [1] etc is this compatible with your approach ? Erick the price is an interval scale ie a fare can be any value (not high, low, medium etc) How sensible would the following approach be index city docs with fields only related to the city unique key in the same index also index fare docs which would be something like: Fare: cityID: xxx Fareclass:standard FareMonth: Jan FarePrice: 100 the query would be something like: q=FarePrice:[* TO 100] FareMonth:Jan fl=cityID returning facets for FareClass and FareMonth. hold on this will not facet city docs correctly. sorry thasts not going to work. On 1 December 2010 16:25, Erick Erickson erickerick...@gmail.com wrote: Hmmm, that's getting to be a pretty clunky query sure enough. Now you're going to have to insure that HTTP request that long get through and stuff like that I'm reaching a bit here, but you can facet on a tokenized field. Although
Re: Is this sort order possible in a single query?
You could do it with sorting on a functionquery (which is supported from solr 1.5) http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function Consider the search: http://localhost:8093/solr/select?author:'j.k.rowling' sorting like you specified would involve: 1. introducing an extra field: 'author_exact' of type 'string' which takes care of the exact matching. (You can populate it by defining it as a copyfield of Author so your indexing-code doesn't change) 2. set sortMissingLast=true for 'num_copies' and 'num_comments' like: fieldType name=num_copies sorMissingLast=true this makes sure that documents which don't have the value set end up at the end of the sort when sorted on that particular field. 3. construct a functionquery that scores either 0 (no match) or x (not sure what x is (1?) , but it should always be the same for all exact matches ) This gives http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc which scores all exact matches before all partial matches. 4. now just concatenate the other sorts giving: http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc, num_copies desc, num_comments desc That should do it. Please note that 'num_copies' and 'num_comments' still kick in to break the tie for documents that exactly match on 'author_exact'. I assume this is ok. I can't see a way to do it without functionqueries at the moment, which doesn't mean there isn't any. Hope that helps, Geert-Jan *query({!dismax qf=text v='solr rocks'})* * * 2010/11/24 Robert Gründler rob...@dubture.com Hi, we have a requirement for one of our search results which has a quite complex sorting strategy. Let me explain the document first, using an example: The document is a book. It has several indexed text fields: Title, Author, Distributor. It has two integer columns, where one reflects the number of sold copies (num_copies), and the other reflects the number of comments on the website (num_comments). The Requirement for the relevancy looks like this: * Documents which have exact matches in the Author field, should be ranked highest, disregarding their values in num_copies and num_comments fields * After the exact matches, the sorting should be based on the value in the field num_copies, but only for documents, where this field is set * After the num_copies matches, the sorting should be based on num_comments I'm wondering is this kind of sort order can be implemented in a single query, or if i need to break it down into several queries and merge the results on application level. -robert
Re: How to get facet counts without fields that are constrained by themselves?
http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters 2010/11/24 Petrov Sergey geoco...@yandex.ua I need to retrieve result of query and facet counts for all searchable document fields. I can't get correct results in case when facets counts are calculated for field that is in search query. Facet counts are calculated to match the whole query, but for this field I need to get values, that are constrained by all query params except of query on current field (so facet values must to be constrained by all query values except of current field itself). Variant with performing one full query plus as many queries, as is the count of search fields, gives me what I need, but I think that there must be a better way to solve this problem. P.S. Sorry for my English.
Re: Is this sort order possible in a single query?
hmm, sorry about that. I haven't used the 'sort by functionquery'-option myself, but I remembered it existed. Indeed solr 1.5 was never released (as you've read in the link you pointed out) the relevant JIRA-issue: https://issues.apache.org/jira/browse/SOLR-1297 https://issues.apache.org/jira/browse/SOLR-1297There's some recent activity and a final post suggesting the patch works. (assumingly under either 3.1 and/or 4.x) Both branches are not released at the moment though, although 3.1 should be pretty close (and perhaps stable enough) . I'm just not sure. Your best bet is to start a new thread asking at what branch to patch SOLR-1297 https://issues.apache.org/jira/browse/SOLR-1297 and asking the subjective 'is it stable enough?'. Hope that helps some, Geert-Jan 2010/11/24 Robert Gründler rob...@dubture.com thanks a lot for the explanation. i'm a little confused about solr 1.5, especially after finding this wiki page: http://wiki.apache.org/solr/Solr1.5 Is there a stable build available for version 1.5, so i can test your suggestion using functionquery? -robert On Nov 24, 2010, at 1:53 PM, Geert-Jan Brits wrote: You could do it with sorting on a functionquery (which is supported from solr 1.5) http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function Consider the search: http://localhost:8093/solr/select?author:'j.k.rowling' sorting like you specified would involve: 1. introducing an extra field: 'author_exact' of type 'string' which takes care of the exact matching. (You can populate it by defining it as a copyfield of Author so your indexing-code doesn't change) 2. set sortMissingLast=true for 'num_copies' and 'num_comments' like: fieldType name=num_copies sorMissingLast=true this makes sure that documents which don't have the value set end up at the end of the sort when sorted on that particular field. 3. construct a functionquery that scores either 0 (no match) or x (not sure what x is (1?) , but it should always be the same for all exact matches ) This gives http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc which scores all exact matches before all partial matches. 4. now just concatenate the other sorts giving: http://localhost:8093/solr/select?author:'j.k.rowling'sort=query({!dismaxqf=author_exact v='j.k.rowling'}) desc, num_copies desc, num_comments desc That should do it. Please note that 'num_copies' and 'num_comments' still kick in to break the tie for documents that exactly match on 'author_exact'. I assume this is ok. I can't see a way to do it without functionqueries at the moment, which doesn't mean there isn't any. Hope that helps, Geert-Jan *query({!dismax qf=text v='solr rocks'})* * * 2010/11/24 Robert Gründler rob...@dubture.com Hi, we have a requirement for one of our search results which has a quite complex sorting strategy. Let me explain the document first, using an example: The document is a book. It has several indexed text fields: Title, Author, Distributor. It has two integer columns, where one reflects the number of sold copies (num_copies), and the other reflects the number of comments on the website (num_comments). The Requirement for the relevancy looks like this: * Documents which have exact matches in the Author field, should be ranked highest, disregarding their values in num_copies and num_comments fields * After the exact matches, the sorting should be based on the value in the field num_copies, but only for documents, where this field is set * After the num_copies matches, the sorting should be based on num_comments I'm wondering is this kind of sort order can be implemented in a single query, or if i need to break it down into several queries and merge the results on application level. -robert
Re: SOLR and secure content
When making a query these fields should be required. Is it possible to configure handlers on the solr server so that these field are required whith each type of query? So for adding documents, deleting and querying? have a look at 'invariants' (and 'appends') in the example solrconfig. They can be defined per requesthandler and do exactly what you describe (at least for the search-side of things) Cheers, Geert-Jan 2010/11/23 Jos Janssen j...@websdesign.nl Hi everyone, This is how we think we should set it up. Situation: - Multiple websites indexed on 1 solr server - Results should be seperated for each website - Search results should be filtered on group access Solution i think is possible with solr: - Solr server should only be accesed through API which we will write in PHP. - Solr server authentication wil be defined through IP adres on server side and username and password will be send through API for each different website. - Extra document fields in Solr server will contain: 1. Website Hash to identify and filter results fo each different website (Website authentication) 2. list of groups who can access the document (Group authentication) When making a query these fields should be required. Is it possible to configure handlers on the solr server so that these field are required whith each type of query? So for adding documents, deleting and querying? Am i correct? Any further advice is welcome. regard, Jos -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-and-secure-content-tp1945028p1953071.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to Facet on a price range
Ah I see: like you said it's part of the facet range implementation. Frontend is already working, just need the 'update-on-slide' behavior. Thanks Geert-Jan 2010/11/10 gwk g...@eyefi.nl On 11/9/2010 7:32 PM, Geert-Jan Brits wrote: when you drag the sliders , an update of how many results would match is immediately shown. I really like this. How did you do this? IS this out-of-the-box available with the suggested Facet_by_range patch? Hi, With the range facets you get the facet counts for every discrete step of the slider, these values are requested in the AJAX request whenever search criteria change and then someone uses the sliders we simply check the range that is selected and add the discrete values of that range to get the expected amount of results. So yes it is available, but as Solr is just the search backend the frontend stuff you'll have to write yourself. Regards, gwk
Re: Facet showing MORE results than expected when its selected?
Another option : assuming themes_raw is type 'string' (couldn't get that nugget of info for 100%) it could be that you're seeing a difference in nr of results between the 110 for fq:themes_raw and 321 from your db, because fieldtype:string (thus themes_raw) is case-sensitive while (depending on your db-setup) querying your db is case-insensitive, which could explain the larger nr of hits for your db as well. Cheers, Geert-Jan 2010/11/10 Jonathan Rochkind rochk...@jhu.edu I've had that sort of thing happen from 'corrupting' my index, by changing my schema.xml without re-indexing. If you change field types or other things in schema.xml, you need to reindex all your data. (You can add brand new fields or types without having to re-index, but most other changes will require a re-index). Could that be it? PeterKerk wrote: LOL, very clever indeed ;) The thing is: when I select the amount of records matching the theme 'Hotel en Restaurant' in my db, I end up with 321 records. So that is correct. I dont know where the 370 is coming from. Now when I change the query to this: fq=themes_raw:Hotel en Restaurant I end up with 110 records...(another number even :s) What I did notice, is that this only happens on multi-word facets Hotel en Restaurant being a 3 word facet. The facets work correct on a facet named Cafe, so I suspect it has something to do with the tokenization. As you can see, I'm using text and string. For compleness Im posting definition of those in my schema.xml as well: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true /
Re: How to Facet on a price range
Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: facet=onfacet.query=price:[50 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayant jayan...@hotmail.com That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to Facet on a price range
@ http://www.mysecondhome.co.uk/search.htmhttp://www.mysecondhome.co.uk/search.html -- when you drag the sliders , an update of how many results would match is immediately shown. I really like this. How did you do this? IS this out-of-the-box available with the suggested Facet_by_range patch? Thanks, Geert-Jan 2010/11/9 gwk g...@eyefi.nl Hi, Instead of all the facet queries, you can also make use of range facets ( http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range), which is in trunk afaik, it should also be patchable into older versions of Solr, although that should not be necessary. We make use of it (http://www.mysecondhome.co.uk/search.html) to create the nice sliders Geert-Jan describes. We've also used it to add the sparklines above the sliders which give a nice indication of how the current selection is spread out. Regards, gwk On 11/9/2010 3:33 PM, Geert-Jan Brits wrote: Just to add to this, if you want to allow the user more choice in his option to select ranges, perhaps by using a 2-sided javasacript slider for the pricerange (ala kayak.com) it may be very worthwhile to discretize the allowed values for the slider (e.g: steps of 5 dolllar) Most js-slider implementations allow for this easily. This has the advantages of: - having far fewer possible facetqueries and thus a far greater chance of these facetqueries hitting the cache. - a better user-experience, although that's debatable. just to be clear: for this the Solr-side would still use: facet=onfacet.query=price:[50 TO *]facet.query=price:[* TO 100] and not the optimized pre-computed variant suggested above. Geert-Jan 2010/11/9 jayantjayan...@hotmail.com That was very well thought of and a clever solution. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Facet-on-a-price-range-tp1846392p1869201.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: dynamic stop words?
That might work, although depending on your use-case it might be hard to have a good controlled vocab on citynames (hotel metropole bruxelles, hotel metropole brussels, hotel metropole brussel, etc.) Also 'hotel paris bruxelles' stinks... given your example: Doc 1 name = Holiday Inn city = Denver Doc 2 name = Holiday Inn, Denver city = Denver q=name:(Holiday Inn, Denver) turning it upside down, perhaps an alternative would be to query on: q=name:Holiday Inn+city:Denver and configure field 'name' in such a way that doc1 and doc2 score the same. I believe that must be possible, just not sure how to config it exactly at the moment. Of course, it depends on your scenario if you have enough knowlegde on the clientside to transform: q=name:(Holiday Inn, Denver) to q=name:Holiday Inn+city:Denver Hth, Geert-Jan 2010/10/9 Otis Gospodnetic otis_gospodne...@yahoo.com Matt, The first thing that came to my mind is that this might be interesting to try with a dictionary (of city names) if this example is not a made-up one. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Matt Mitchell goodie...@gmail.com To: solr-user@lucene.apache.org Sent: Fri, October 8, 2010 11:22:36 AM Subject: dynamic stop words? Is it possible to have certain query terms not effect score, if that same query term is present in a field? For example, I have an index of hotels. Each hotel has a name and city. If the name of a hotel has the name of the city in it's name field, I want to completely ignore that and not have it influence score. Example: Doc 1 name = Holiday Inn city = Denver Doc 2 name = Holiday Inn, Denver city = Denver q=name:(Holiday Inn, Denver) I'd like those docs to have the same score in the response. I don't want Doc2 to have a higher score, just because it has all of the query terms. Is this possible without using stop words? I hope this makes sense! Thanks, Matt
Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?
You're right for the general case. I should have added that our setup is perhaps a little bit out of the ordinary in that we send explicit commits to solr as part of our indexing app. Once a commit has finished we're sure all docs until then are present in solr. For us it's much more difficult to do the way you suggested bc we index into several embedded solr shards, etc. It can be done it's just not convienient. But for the general case I admit querying all ids as a post-process is probably the more elegant and robust way. 2010/9/9 Scott K s...@skister.com But how do you know when the document actually makes it to solr, especially if you are using commitWithin and not explicitly calling commit. One solution is to have a status field in the database such as 0 - unindexed 1 - indexing 2 - committed / verified And have a separate process query solr for documents in the indexing state and set them to committed if they are queryable in solr. On Tue, Sep 7, 2010 at 14:26, Geert-Jan Brits gbr...@gmail.com wrote: Please let me know if there are any other ideas / suggestions to implement this. You're indexing program should really take care of this IMHO. Each time your indexer inserts a document to Solr, flag the corresponding entity in your RDBMS, each time you delete, remove the flag. You should implement this as a transaction to make sure all is still fine in the unlikely event of a crash midway. 2010/9/7 bbarani bbar...@gmail.com Hi, I am trying to get complete list of unique document ID and compare it with that of back end to make sure that both back end and SOLR documents are in sync. Is there a way to fetch the complete list of data from a particular column in SOLR document? Once I get the list, I can easily compare it against the DB and delete the orphan documents.. Please let me know if there are any other ideas / suggestions to implement this. Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Is there a way to fetch the complete list of data from a particular column in SOLR document?
Please let me know if there are any other ideas / suggestions to implement this. You're indexing program should really take care of this IMHO. Each time your indexer inserts a document to Solr, flag the corresponding entity in your RDBMS, each time you delete, remove the flag. You should implement this as a transaction to make sure all is still fine in the unlikely event of a crash midway. 2010/9/7 bbarani bbar...@gmail.com Hi, I am trying to get complete list of unique document ID and compare it with that of back end to make sure that both back end and SOLR documents are in sync. Is there a way to fetch the complete list of data from a particular column in SOLR document? Once I get the list, I can easily compare it against the DB and delete the orphan documents.. Please let me know if there are any other ideas / suggestions to implement this. Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/Is-there-a-way-to-fetch-the-complete-list-of-data-from-a-particular-column-in-SOLR-document-tp1435586p1435586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: High - Low field value?
StatsComponent is exactly what you're looking for. http://wiki.apache.org/solr/StatsComponent http://wiki.apache.org/solr/StatsComponentCheers, Geert-Jan 2010/9/1 kenf_nc ken.fos...@realestate.com I want to do range facets on a couple fields, a Price field in particular. But Price is relative to the product type. Books, Automobiles and Houses are vastly different price ranges, and withing Houses there may be a regional difference (price range in San Francisco is different than Columbus, OH for example). If I do Filter Query on type, so I'm not mixing books with houses, is there a quick way in a query to get the High and Low value for a given field? I would need those to build my range boundaries more efficiently. Ideally it would be a function of the query, so regionality could be taken into account. It's not a search score, or a facet, it's more a function. I know query functions exist, but haven't had to use them yet and the 'max' function doesn't look like what I need. Any suggestions? Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/High-Low-field-value-tp1402568p1402568.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: questions about synonyms
concerning: . I got a very big text file of synonyms. How I can use it? Do I need to index this text file first? have you seen http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter ? Cheers, Geert-Jan http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter 2010/8/31 Ma, Xiaohui (NIH/NLM/LHC) [C] xiao...@mail.nlm.nih.gov Hello, I have an couple of questions about synonyms. 1. I got a very big text file of synonyms. How I can use it? Do I need to index this text file first? 2. Is there a way to do synonyms' highlight in search result? 3. Does anyone use WordNet to solr? Thanks so much in advance,
Re: solr working...
Check out Drew Farris' explantion for remote debugging Solr with Eclipse posted a couple of days ago: http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-td1262050.html Geert-Jan 2010/8/26 Michael Griffiths mgriffi...@am-ind.com Take a look at the code? It _is_ open source. Open it up in Eclipse and debug it. -Original Message- From: satya swaroop [mailto:sswaro...@gmail.com] Sent: Thursday, August 26, 2010 8:24 AM To: solr-user@lucene.apache.org Subject: Re: solr working... Hi peter, I am already working on solr and it is working good. But i want to understand the code and know where the actual working is going on, and how indexing is done and how the requests are parsed and how it is responding and all others. TO understand the code i asked how to start??? Regards, satya
Re: Solr search speed very low
have a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters to see how that works. 2010/8/25 Marco Martinez mmarti...@paradigmatecnologico.com You should use the tokenizer solr.WhitespaceTokenizerFactory in your field type to get your terms indexed, once you have indexed the data, you dont need to use the * in your queries that is a heavy query to solr. Marco Martínez Bautista http://www.paradigmatecnologico.com Avenida de Europa, 26. Ática 5. 3ª Planta 28224 Pozuelo de Alarcón Tel.: 91 352 59 42 2010/8/25 Andrey Sapegin andrey.sape...@unister-gmbh.de Dear ladies and gentlemen. I'm newbie with Solr, I didn't find an aswer in wiki, so I'm writing here. I'm analysing Solr performance and have 1 problem. *Search time is about 7-10 seconds per query.* I have a *.csv 5Gb-database with about 15 fields and 1 key field (record number). I uploaded it to Solr without any problem using curl. This database contains information about books and I'm intrested in keyword search using one of the fields (not a key field). I mean that if I search, for example, for word Hello, I expect response with sentences containing Hello: Hello all Hello World I say Hello to all etc. I tested it from console using time command and curl: /usr/bin/time -o test_results/time_solr -a curl http://localhost:8983/solr/select/?q=itemname:*$query*version=2.2start=0rows=10indent=on -6 21 test_results/response_solr So, my query is *itemname:*$query**. 'Itemname' - is the name of field. $query - is a bash variable containing only 1 word. All works fine. *But unfortunately, search time is about 7-10 seconds per query.* For example, Sphinx spent only about 0.3 second per query. If I use only $query, without stars (*), I receive answer pretty fast, but only exact matches. And I want to see any sentence containing my $query in the response. Thats why I'm using stars. NOW THE QUESTION. Is my query syntax correct (*field:*word**) for keyword search)? Why response time is so big? Can I reduce search time? Thank You in advance, Kind Regards, Andrey Sapegin, Software Developer, Unister GmbH Barfußgässchen 11 | 04109 Leipzig andrey.sape...@unister-gmbh.de mailto:%20andreas.b...@unister-gmbh.de www.unister.de http://www.unister.de
Re: How to Debug Sol-Code in Eclipse ?!
1. download solr lib and import them in your project. 2. download solr source-code of the same version and attach in to the libraries. (I haven't got eclipse open but it is something like project - settings - jre/libraries?) 3. write a small program yourself which calls EmbededSolrServer and step-through/debug the source-code from there. It works just like it is your own source-code. HTH, Geert-Jan 2010/8/22 stockii st...@shopgate.com thx for you reply. i dont want to test my own classes in unittest. i try to understand how solr works , because i write a little text about solr and lucene. so i want go through the code, step by step and find out on which places is solr using lucene. when i can debug the code its easyer ;-) -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Debug-Sol-Code-in-Eclipse-tp1262050p1274285.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to support implicit trailing wildcards
you could satisfy this by making 2 fields: 1. exactmatch 2. wildcardmatch use copyfield in your schema to copy 1 -- 2 . q=exactmatch:mount+wildcardmatch:mount*q.op=OR this would score exact matches above (solely) wildcard matches Geert-Jan 2010/8/10 yandong yao yydz...@gmail.com Hi Bastian, Sorry for not make it clear, I also want exact match have higher score than wildcard match, that is means: if searching 'mount', documents with 'mount' will have higher score than documents with 'mountain', while 'mount*' seems treat 'mount' and 'mountain' as same. besides, also want the query to be processed with analyzer, while from http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F , Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The rationale is that if search 'mounted', I also want documents with 'mount' match. So seems built-in wildcard search could not satisfy my requirements if i understand correctly. Thanks very much! 2010/8/9 Bastian Spitzer bspit...@magix.net Wildcard-Search is already built in, just use: ?q=umoun* ?q=mounta* -Ursprüngliche Nachricht- Von: yandong yao [mailto:yydz...@gmail.com] Gesendet: Montag, 9. August 2010 15:57 An: solr-user@lucene.apache.org Betreff: how to support implicit trailing wildcards Hi everyone, How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain' will be matched. From my point of view, there are several ways, both with disadvantages: 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also. 2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts. 3) Write custom SearchComponent, while have no idea where/how to start with. Is there any other way in Solr to do this, any feedback/suggestion are welcome! Thanks very much in advance!
Re: How do i update some document when i use sharding indexs?
I'm not sure if Solr has some build-in support for sharding-functions, but you should generally use some hashing-algorithm to split the indices and use the same hash-algorithm to locate which shard contains a document. http://en.wikipedia.org/wiki/Hash_function Without employing any domain knowledge (of documents you possible want to group toegether on a single shard for performance) you could build a very simple (crude) hash-function by md5-hashing the unique-keys of your documents, taking the first 3 chars (should be precise enough, so load is pretty much balanced), calculate a nr from the chars (256 * first char + 16 * 2nd char + 3rd char), and take that nr modulo 20. That should give you a nr in [0,20) which is the shard-index. use the same algorithm to determine which shard contains the document that you want to change. Geert-Jan 2010/8/9 lu.rongbin lu.rong...@goodhope.net My index has 76 million documents, I split it to 20 indexs because the size of index is 33G. I deploy 20 shards for search response performence on ec2's 20 instances.But when i wan't to update some doc, it means i must traversal each index , and find the document is in which shard index, and update the doc? It's crazy! How can i do? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How do i update some document when i use sharding indexs?
Just to be completely clear: the program that splits your index in 20 shards should employ this algo as well. 2010/8/9 Geert-Jan Brits gbr...@gmail.com I'm not sure if Solr has some build-in support for sharding-functions, but you should generally use some hashing-algorithm to split the indices and use the same hash-algorithm to locate which shard contains a document. http://en.wikipedia.org/wiki/Hash_function Without employing any domain knowledge (of documents you possible want to group toegether on a single shard for performance) you could build a very simple (crude) hash-function by md5-hashing the unique-keys of your documents, taking the first 3 chars (should be precise enough, so load is pretty much balanced), calculate a nr from the chars (256 * first char + 16 * 2nd char + 3rd char), and take that nr modulo 20. That should give you a nr in [0,20) which is the shard-index. use the same algorithm to determine which shard contains the document that you want to change. Geert-Jan 2010/8/9 lu.rongbin lu.rong...@goodhope.net My index has 76 million documents, I split it to 20 indexs because the size of index is 33G. I deploy 20 shards for search response performence on ec2's 20 instances.But when i wan't to update some doc, it means i must traversal each index , and find the document is in which shard index, and update the doc? It's crazy! How can i do? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/How-do-i-update-some-document-when-i-use-sharding-indexs-tp1053509p1053509.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: XML Format
at first glance I see no difference between the 2 documents. Perhaps you can illustrate which fields are not in the resultset that you want to be there? also use the 'fl'-param to describe which fields should be outputted in your results. Of course, you have to first make sure the fields you want outputted are stored to begin with. http://wiki.apache.org/solr/CommonQueryParameters#fl http://wiki.apache.org/solr/CommonQueryParameters#fl 2010/8/6 twojah e...@tokobagus.com can somebody help me please -- View this message in context: http://lucene.472066.n3.nabble.com/XML-Format-tp1024608p1028456.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to take a value from the query result
you should parse the xml and extract the value. Lot's of libraries undoubtably exist for PHP to help you with that (I don't know PHP) Moreover, if all you want from the result is AUC_CAT you should consider using the fl=param like: http://172.16.17.126:8983/search/select/?q=AUC_ID:607136fl=AUC_CAT to return a document of the form: doc int name=AUC_CAT576/int /doc which if more efficient. Still you have to parse the doc with xml though. http://172.16.17.126:8983/search/select/?q=AUC_ID:607136 2010/8/5 twojah e...@tokobagus.com this is my query in browser navigation toolbar http://172.16.17.126:8983/search/select/?q=AUC_ID:607136 and this is the result in browser page: ... doc int name=AP_AUC_PHOTO_AVAIL1/int double name=AUC_AD_PRICE1.0/double int name=AUC_CAT576/int int name=AUC_CLIENT_ID27017/int str name=AUC_DESCR_SHORTBracket Ceiling untuk semua merk projector, panjang 60-90 cm Bahan Besi Cat Hitam = 325rb Bahan Sta/str str name=AUC_HTML_DIR_NL/aksesoris-batere-dan-tripod/update-bracket-projector-dan-lcd-plasma-tv-607136.html/str int name=AUC_ID607136/int str name=AUC_ISNEGONego/str int name=AUC_LOCATION7/int str name=AUC_PHOTO270/27017/bracket_lcd_plasma_3a-1274291780.JPG/str str name=AUC_START2010-05-19 17:56:45/str str name=AUC_TITLE[UPDATE] BRACKET Projector dan LCD/PLASMA TV/str int name=AUC_TYPE21/int int name=PRO_BACKGROUND0/int int name=PRO_BOLD0/int int name=PRO_COLOR0/int int name=PRO_GALLERY0/int int name=PRO_LINK0/int int name=PRO_SPONSOR0/int int name=cat_id_sub0/int int name=sectioncode28/int /doc I want to get the AUC_CAT value (576) and using it in my PHP, how can I get that value? please help thanks before -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-take-a-value-from-the-query-result-tp1025119p1025119.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: No group by? looking for an alternative.
If I understand correctly: 1. products have different product variants ( in case of shoes a combination of color and size + some other fields). 2. Each product is shown once in the result set. (so no multiple product variants of the same product are shown) This would solve that IMO: 1, create 1 document per product (so not a document per product-variant) 2.create a multivalued field on which to facet containing: all combinations of: size-color-any other field-yett another field 3. make sure to include combinations in which the user is indifferent of a particular filter. i.e: don't care about size (dc) + red -- dc-red 4. filtering on that combination would give you all the products that satisfy the product-variant constraints (size, color, etc.) + the extra product constraints ('converse) 5. on the detail page show all available product-variants not filtered by the constraints specified. This would likely be something outside of solr (a simple sql-select on a single product) hope that helps, Geert-Jan 2010/8/5 Mickael Magniez mickaelmagn...@gmail.com I've got only one document per shoes, whatever its size or color. My first try was to create one document per model/size/color, but when i searche for 'converse' for example, the same shoe is retrieved several times, and i want to show only one record for each model. But I don't succeed in grouping results by shoe model. If you look at http://www.amazon.com/s/ref=nb_sb_noss?url=node%3D679255011field-keywords=Converse+All+Star+Leather+Hi+Chuck+Taylor+x=0y=0ih=1_0_0_0_0_0_0_0_0_0.4136_1fsc=-1 amazon for Converse All Star Leather Hi Chuck Taylor . They show the shoe only one time, but if you go on the product details, its exists in several colors and sizes. Now if you filter or color, there is less sizes available. -- View this message in context: http://lucene.472066.n3.nabble.com/No-group-by-looking-for-an-alternative-tp1022738p1026618.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Best solution to avoiding multiple query requests
Field Collapsing (currently as patch) is exactly what you're looking for imo. http://wiki.apache.org/solr/FieldCollapsing http://wiki.apache.org/solr/FieldCollapsingGeert-Jan 2010/8/4 Ken Krugler kkrugler_li...@transpac.com Hi all, I've got a situation where the key result from an initial search request (let's say for dog) is the list of values from a faceted field, sorted by hit count. For the top 10 of these faceted field values, I need to get the top hit for the target request (dog) restricted to that value for the faceted field. Currently this is 11 total requests, of which the 10 requests following the initial query can be made in parallel. But that's still a lot of requests. So my questions are: 1. Is there any magic query to handle this with Solr as-is? 2. if not, is the best solution to create my own request handler? 3. And in that case, any input/tips on developing this type of custom request handler? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Best solution to avoiding multiple query requests
If I understand correctly: you want to sort your collapsed results by 'nr of collapsed results'/ hits. It seems this can't be done out-of-the-box using this patch (I'm not entirely sure, at least it doesn't follow from the wiki-page. Perhaps best is to check the jira-issues to make sure this isn't already available now, but just not updated on the wiki) Also I found a blogpost (from the patch creator afaik) with in the comments someone with the same issue + some pointers. http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/ hope that helps, Geert-jan 2010/8/4 Ken Krugler kkrugler_li...@transpac.com Hi Geert-Jan, On Aug 4, 2010, at 5:30am, Geert-Jan Brits wrote: Field Collapsing (currently as patch) is exactly what you're looking for imo. http://wiki.apache.org/solr/FieldCollapsing Thanks for the ref, good stuff. I think it's close, but if I understand this correctly, then I could get (using just top two, versus top 10 for simplicity) results that looked like dog training (faceted field value A) super dog (faceted field value B) but if the actual faceted field value/hit counts were: C (10) D (8) A (2) B (1) Then what I'd want is the top hit for dog AND facet field:C, followed by dog AND facet field:D. Used field collapsing would improve the probability that if I asked for the top 100 hits, I'd find entries for each of my top N faceted field values. Thanks again, -- Ken I've got a situation where the key result from an initial search request (let's say for dog) is the list of values from a faceted field, sorted by hit count. For the top 10 of these faceted field values, I need to get the top hit for the target request (dog) restricted to that value for the faceted field. Currently this is 11 total requests, of which the 10 requests following the initial query can be made in parallel. But that's still a lot of requests. So my questions are: 1. Is there any magic query to handle this with Solr as-is? 2. if not, is the best solution to create my own request handler? 3. And in that case, any input/tips on developing this type of custom request handler? Thanks, -- Ken Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g
Re: Quering the database
No. With Solr is really flexible and allows for a lot of complex querying out-of-the-box. Really the Wiki is your best friend here. http://wiki.apache.org/solr/ perhaps start with: 1. http://lucene.apache.org/solr/tutorial.html 2. http://wiki.apache.org/solr/SolrQuerySyntax 3. http://wiki.apache.org/solr/QueryParametersIndex (list of some standard parameters with link to their function/use) -- especially look at the 'fq'-param which is aanother way to limit your result-set. and just browse the wiki starting from the homepage for the rest. It should pretty quickly give you some an overview of what's possible. cheers, Geert-Jan http://lucene.apache.org/solr/tutorial.html 2010/8/3 Hando420 hando...@gmail.com Thanks alot to all now its clear the problem was in the schema. One more thing i would like to know is if the user queries for something does it have to always be like q=field:monitor where field is defined in schema and monitor is just a text in a column. Hando -- View this message in context: http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1018268.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Quering the database
you should (as per the example) define the field as text in your solr-schema not in your RDB. something like: field name=field_1 type=text indexed=true stored=true required=true/ then search like: q=field_1:monitors the example schema illustrates a lot of the possibilities on how you to define fields and what is all means. Moreover have a look at: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Geert-Jan 2010/8/2 Hando420 hando...@gmail.com Thank you for your reply. Still the the problem persists even i tested with a simple example by defining a column of type text as varchar in database and in schema.xml used the default id which is set to string. Row is fetched and document created but searching doesn't give any results of the content in the column. Best Regards, Hando -- View this message in context: http://lucene.472066.n3.nabble.com/Quering-the-database-tp1015636p1015890.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: advice on creating a solr index when data source is from many unrelated db tables
I can interprete your question in 2 different ways: 1. Do you want to index several heterogenous documents all coming from different tables? So documents of type tableA are created and indexed alongside documents of type tableB, tableC, etc. 2. Do you want to combine unrelated data from 15 tables to form some kind of logical solr-document as your basis for indexing? I assume you mean nr 1. This can be done, and is done quite regularly. And you're right that this creates a lot of empty slots for fields that only exist for documents created from tableA and not tableB, etc. This in itself is not a problem. In this case I would advise you to create an extra field: 'type' (per the above example with values: (table)A, (table)B, etc. ) So you can distinguish the different types of documents that you have created (and filter on them) . If you meant nr2, which I believe you didn't: it's logically impossible to create/imagine a logical solr-document comprised of combining unrelated data. You should really think about what you're trying to achieve (what is it that I want to index, what do I expect to do with it, etc. ) If you did mean this, please show an example of what you want to achieve. HTH, Geert-Jan 2010/7/29 S Ahmed sahmed1...@gmail.com I understand (and its straightforward) when you want to create a index for something simple like Products. But how do you go about creating a Solr index when you have data coming from 10-15 database tables, and the tables have unrelated data? The issue is then you would have many 'columns' in your index, and they will be NULL for much of the data since you are trying to shove 15 db tables into a single Solr/Lucense index. This must be a common problem, what are the potential solutions?
Re: 2 type of docs in same schema?
You can easily have different types of documents in 1 core: 1. define searchquery as a field(just as the others in your schema) 2. define type as a field (this allows you to decide which type of documents to search for, e.g: type_normal or type_search) now searching on regular docs becomes: q=title:some+titlefq=type:type_normal and searching for searchqueries becomes (I think this is what you want): q=searchquery:bmw+carfq=type:type_search Geert-Jan 2010/7/26 scr...@asia.com I need you expertise on this one... We would like to index every search query that is passed in our solr engine (same core) Our docs format are like this (already in our schema): title content price category etc... Now how to add search queries as a field in our schema? Know that the search queries won't have all the field above? For example: q=bmw car q=car wheels q=moto honda etc... Should we run an other core that only index search queries? or is there a way to do this with same instance and same core? Thanks for your help
Re: 2 type of docs in same schema?
I still assume that what you mean by search queries data is just some other form of document (in this case containing 1 seach-request per document) I'm not sure what you intend to do by that actually, but yes indexing stays the same (you probably want to mark field type as required so you don't forget to include in in your indexing-program) 2010/7/26 scr...@asia.com Thanks for you answer! That's great. Now to index search quieries data is there something special to do? or it stay as usual? -Original Message- From: Geert-Jan Brits gbr...@gmail.com To: solr-user@lucene.apache.org Sent: Mon, Jul 26, 2010 4:57 pm Subject: Re: 2 type of docs in same schema? You can easily have different types of documents in 1 core: 1. define searchquery as a field(just as the others in your schema) 2. define type as a field (this allows you to decide which type of documents to search for, e.g: type_normal or type_search) now searching on regular docs becomes: q=title:some+titlefq=type:type_normal and searching for searchqueries becomes (I think this is what you want): q=searchquery:bmw+carfq=type:type_search Geert-Jan 2010/7/26 scr...@asia.com I need you expertise on this one... We would like to index every search query that is passed in our solr engine (same core) Our docs format are like this (already in our schema): title content price category etc... Now how to add search queries as a field in our schema? Know that the search queries won't have all the field above? For example: q=bmw car q=car wheels q=moto honda etc... Should we run an other core that only index search queries? or is there a way to do this with same instance and same core? Thanks for your help
Re: Which is a good XPath generator?
I am assuming (like Li I think) that you want to induce a structure/schema from a html-example so you can use that schema to extract data from similiar html-structured pages. Another term often used in literature for that is Wrapper Induction. Beside DOM, using CSS-classes often give good distinction and they are often more stable under small redesigns. Besides Li's suggestions have a look at this thread for an open source python implementation (I hav enever tested it) http://www.holovaty.com/writing/templatemaker/ also make sure to read all the comments for links to other products, etc. HTH, Geert-Jan 2010/7/25 Li Li fancye...@gmail.com it's not a related topic in solr. maybe you should read some papers about wrapper generation or automatical web data extraction. If you want to generate xpath, you could possibly read liubing's papers such as Structured Data Extraction from the Web based on Partial Tree Alignment. Besides dom tree, visual clues also may be used. But none of them will be perfect solution because of the diversity of web pages. 2010/7/25 Savannah Beckett savannah_becket...@yahoo.com: Hi, I am looking for a XPath generator that can generate xpath by picking a specific tag inside a html. Do you know a good xpath generator? If possible, free xpath generator would be great. Thanks.
Re: Tree Faceting in Solr 1.4
Perhaps completely unnessecery when you have a controlled domain, but I meant to use ids for places instead of names, because names will quickly become ambiguous, e.g.: there are numerous different places over the world called washington, etc. 2010/7/24 SR r.steve@gmail.com Hi Geert-Jan, What did you mean by this: Also, just a suggestion, consider using id's instead of names for filtering; Thanks, -S
Re: Tree Faceting in Solr 1.4
I believe we use an in-process weakhashmap to store the id-name relationship. It's not that we're talking billions of values here. For anything more mem-intensive we use no-sql (tokyo tyrant through memcached protocol at the moment) 2010/7/24 Jonathan Rochkind rochk...@jhu.edu Perhaps completely unnessecery when you have a controlled domain, but I meant to use ids for places instead of names, because names will quickly become ambiguous, e.g.: there are numerous different places over the world called washington, etc. This is related to something I've been thinking about. Okay, say you use ID's instead of names. Now, you've got to translate those ID's to names before you display them, of course. One way to do that would be to keep the id-to-name lookup in some non-solr store (rdbms, or non-sql store) Is that what you'd do? Is there any non-crazy way to do that without an external store, just with solr? Any way to do it with term payloads? Anything else? Jonathan
Re: Tree Faceting in Solr 1.4
If I am doing facet=on facet.field={!ex=State}State fq={!tag=State}State:Karnataka All it gives me is Facets on state excluding only that filter query.. But i was not able to do same on third level ..Like facet.field= Give me the counts of cities also in state Karantaka.. Let me know solution for this... This looks like regular faceting to me. 1. Showing citycounts given state facet=onfq=State:Karnatakafacet.field=city 2. showing statecounts given country (similar to 1) facet=onfq=Country:Indiafacet.field=state 3. showing city and state counts given country: facet=onfq=Country:Indiafacet.field=statefacet.field=city 4. showing city counts given state + all other states not filtered by current state ( http://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filters ) facet=onfq={!tag=State}state:Karnatakafacet.field={!ex=State}statefacet.field=city 5. showing state + city counts given country + all other countries not filtered by current country (shttp://wiki.apache.org/solr/SimpleFacetParameters#Tagging_and_excluding_Filtersimilar to 4) facet=onfq={!tag=country}country:Indiafacet.field={!ex=country}countryfacet.field=cityfacet.field=state etc. This has nothing to do with Hierarchical faceting as described in SOLR-792 btw, although I understand the possible confusion as County state city can obvisouly be seen as some sort of hierarchy. The first part of your question seemed to be more about Hierarchial faceting as per SOLR-792, but I couldn't quite distill a question from that part. Also, just a suggestion, consider using id's instead of names for filtering; you will get burned sooner or later otherwise. HTH, Geert-Jan 2010/7/23 rajini maski rajinima...@gmail.com I am also looking out for same feature in Solr and very keen to know whether it supports this feature of tree faceting... Or we are forced to index in tree faceting formatlike 1/2/3/4 1/2/3 1/2 1 In-case of multilevel faceting it will give only 2 level tree facet is what i found.. If i give query as : country India and state Karnataka and city bangalore...All what i want is a facet count 1) for condition above. 2) The number of states in that Country 3) the number of cities in that state ... Like = Country: India ,State:Karnataka , City: Bangalore 1 State:Karnataka Kerla Tamilnadu Andra Pradesh...and so on City: Mysore Hubli Mangalore Coorg and so on... If I am doing facet=on facet.field={!ex=State}State fq={!tag=State}State:Karnataka All it gives me is Facets on state excluding only that filter query.. But i was not able to do same on third level ..Like facet.field= Give me the counts of cities also in state Karantaka.. Let me know solution for this... Regards, Rajani Maski On Thu, Jul 22, 2010 at 10:13 PM, Eric Grobler impalah...@googlemail.com wrote: Thank you for the link. I was not aware of the multifaceting syntax - this will enable me to run 1 less query on the main page! However this is not a tree faceting feature. Thanks Eric On Thu, Jul 22, 2010 at 4:51 PM, SR r.steve@gmail.com wrote: Perhaps the following article can help: http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html -S On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote: Hi Solr Community If I have: COUNTRY CITY Germany Berlin Germany Hamburg Spain Madrid Can I do faceting like: Germany Berlin Hamburg Spain Madrid I tried to apply SOLR-792 to the current trunk but it does not seem to be compatible. Maybe there is a similar feature existing in the latest builds? Thanks Regards Eric
Re: help with a schema design problem
With the usecase you specified it should work to just index each Row as you described in your initial post to be a seperate document. This way p_value and p_type all get singlevalued and you get a correct combination of p_value and p_type. However, this may not go so well with other use-cases you have in mind, e.g.: requiring that no multiple results are returned with the same document id. 2010/7/23 Pramod Goyal pramod.go...@gmail.com I want to do that. But if i understand correctly in solr it would store the field like this: p_value: Pramod Raj p_type: Client Supplier When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: I think you just want something like: p_value:Pramod AND p_type:Supplier no? -Kallin Nagelberg -Original Message- From: Pramod Goyal [mailto:pramod.go...@gmail.com] Sent: Friday, July 23, 2010 2:17 PM To: solr-user@lucene.apache.org Subject: help with a schema design problem Hi, Lets say i have table with 3 columns document id Party Value and Party Type. In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type: Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier. Now in this table if i use SQL its easy for me find all document with Party Value as Pramod and Party Type as Client. I need to design solr schema so that i can do the same in Solr. If i create 2 fields in solr schema Party value and Party type both of them multi valued and try to query +Pramod +Supplier then solr will return me the first document, even though in the first document Pramod is a client and not a supplier Thanks, Pramod Goyal
Re: help with a schema design problem
Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. No, I'm 99% sure there is not. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. yep, for the use-case you mentioned that would definitely work. Multivalued of course, so it can contain Supplier Raj as well. 2010/7/23 Pramod Goyal pramod.go...@gmail.com In my case the document id is the unique key( each row is not a unique document ) . So a single document has multiple Party Value and Party Type. Hence i need to define both Party value and Party type as mutli-valued. Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. Is there any other way i can design my schema ? I have some solutions but none seems to be a good solution. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com wrote: With the usecase you specified it should work to just index each Row as you described in your initial post to be a seperate document. This way p_value and p_type all get singlevalued and you get a correct combination of p_value and p_type. However, this may not go so well with other use-cases you have in mind, e.g.: requiring that no multiple results are returned with the same document id. 2010/7/23 Pramod Goyal pramod.go...@gmail.com I want to do that. But if i understand correctly in solr it would store the field like this: p_value: Pramod Raj p_type: Client Supplier When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: I think you just want something like: p_value:Pramod AND p_type:Supplier no? -Kallin Nagelberg -Original Message- From: Pramod Goyal [mailto:pramod.go...@gmail.com] Sent: Friday, July 23, 2010 2:17 PM To: solr-user@lucene.apache.org Subject: help with a schema design problem Hi, Lets say i have table with 3 columns document id Party Value and Party Type. In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type: Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier. Now in this table if i use SQL its easy for me find all document with Party Value as Pramod and Party Type as Client. I need to design solr schema so that i can do the same in Solr. If i create 2 fields in solr schema Party value and Party type both of them multi valued and try to query +Pramod +Supplier then solr will return me the first document, even though in the first document Pramod is a client and not a supplier Thanks, Pramod Goyal
Re: filter query on timestamp slowing query???
just wanted to mention a possible other route, which might be entirely hypothetical :-) *If* you could query on internal docid (I'm not sure that it's available out-of-the-box, or if you can at all) your original problem, quoted below, could imo be simplified to asking for the last docid inserted (that match the other criteria from your use-case) and in the next call filter from that docid forward. Every 30 minutes, i ask the index what are the documents that were added to it, since the last time i queried it, that match a certain criteria. From time to time, once a week or so, i ask the index for ALL the documents that match that criteria. (i also do this for not only one query, but several) This is why i need the timestamp filter. Again, I'm not entirely sure that quering / filtering on internal docid's is possible (perhaps someone can comment) but if it is, it would perhaps be more performant. Big IF, I know. Geert-Jan 2010/7/23 Chris Hostetter hossman_luc...@fucit.org : On top of using trie dates, you might consider separating the timestamp : portion and the type portion of the fq into seperate fq parameters -- : that will allow them to to be stored in the filter cache seperately. So : for instance, if you include type:x OR type:y in queries a lot, but : with different date ranges, then when you make a new query, the set for : type:x OR type:y can be pulled from the filter cache and intersected definitely ... that's the one big thing that jumped out at me once you showed us *how* you were constructing these queries. -Hoss
Re: help with a schema design problem
Multiple rows in the OPs example are combined to form 1 solr-document (e.g: row 1 and 2 both have documentid=1) Because of this combine, it would match p_value from row1 with p_type from row2 (or vice versa) 2010/7/23 Nagelberg, Kallin knagelb...@globeandmail.com When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. Would it? I would expect it to give you nothing. -Kal -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Friday, July 23, 2010 5:05 PM To: solr-user@lucene.apache.org Subject: Re: help with a schema design problem Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. No, I'm 99% sure there is not. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. yep, for the use-case you mentioned that would definitely work. Multivalued of course, so it can contain Supplier Raj as well. 2010/7/23 Pramod Goyal pramod.go...@gmail.com In my case the document id is the unique key( each row is not a unique document ) . So a single document has multiple Party Value and Party Type. Hence i need to define both Party value and Party type as mutli-valued. Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. Is there any other way i can design my schema ? I have some solutions but none seems to be a good solution. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com wrote: With the usecase you specified it should work to just index each Row as you described in your initial post to be a seperate document. This way p_value and p_type all get singlevalued and you get a correct combination of p_value and p_type. However, this may not go so well with other use-cases you have in mind, e.g.: requiring that no multiple results are returned with the same document id. 2010/7/23 Pramod Goyal pramod.go...@gmail.com I want to do that. But if i understand correctly in solr it would store the field like this: p_value: Pramod Raj p_type: Client Supplier When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: I think you just want something like: p_value:Pramod AND p_type:Supplier no? -Kallin Nagelberg -Original Message- From: Pramod Goyal [mailto:pramod.go...@gmail.com] Sent: Friday, July 23, 2010 2:17 PM To: solr-user@lucene.apache.org Subject: help with a schema design problem Hi, Lets say i have table with 3 columns document id Party Value and Party Type. In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type: Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier. Now in this table if i use SQL its easy for me find all document with Party Value as Pramod and Party Type as Client. I need to design solr schema so that i can do the same in Solr. If i create 2 fields in solr schema Party value and Party type both of them multi valued and try to query +Pramod +Supplier then solr will return me the first document, even though in the first document Pramod is a client and not a supplier Thanks, Pramod Goyal
Re: indexing best practices
Have you read: http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr To be short there are only guidelines (see links) no definitive answers. If you followed the guidelines for improviing indexing speed on a single box and after having tested various settings indexing is still too slow, you may want to test the scenario: 1. indexing to several boxes/shards (using round robin or something). 2. copy all created indexes to one box. 3. use indexwriter.addIndexes to merge the indexes. 1/2/3 done on ssd's is of course going to boost performance a lot as well (on large indexes, bc small ones may fit in disk cache entirely) http://wiki.apache.org/lucene-java/ImproveIndexingSpeed Hope that helps a bit, Geert-Jan 2010/7/18 kenf_nc ken.fos...@realestate.com No one has done performance analysis? Or has a link to anywhere where it's been done? basically fastest way to get documents into Solr. So many options available, what's the fastest: 1) file import (xml, csv) vs DIH vs POSTing 2) number of concurrent clients 1 vs 10 vs 100 ...is there a diminishing returns number? I have 16 million small (8 to 10 fields, no large text fields) docs that get updated monthly and 2.5 million largish (20 to 30 fields, a couple html text fields) that get updated monthly. It currently takes about 20 hours to do a full import. I would like to cut that down as much as possible. Thanks, Ken -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Re: How to speed up solr search speed
My query string is always simple like design, principle of design, tom EG: URL: http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on IMO, indeed with these types of simple searches caching (and thus RAM usage) can not be fully exploited, i.e: there isn't really anything to cache (no sort-ordering, faceting (Lucene fieldcache), no documentsets,faceting (Solr filtercache)) The only thing that helps you here would be a big solr querycache, depending on how often queries are repeated. Just execute the same query twice, the second time you should see a fast response (say 20ms) that's the querycache (and thus RAM) working for you. Now the issue I found is search with fq argument looks slow down the search. This doesn't align with your previous statement that you only use search with a q-param (e.g: http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on ) For your own sake, explain what you're trying to do, otherwise we really are guessing in the dark. Anyway the FQ-param let's you cache (using the Solr-filtercache) individual documentsets that can be used to efficiently to intersect your resultset. Also the first time, caches should be warmed (i.e: the fq-query should be exectuted and results saved to cache, since there isn't anything there yet) . Only on the second time would you start seeing improvements. For instance: http://localhost:7550/solr/select/?q=designfq=doctype:pdfversion=2.2start=0rows=10indent=onhttp://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=onwould only show documents containing design when the doctype=pdf (Again this is just an example here where I'm just assuming that you have defined a field 'doctype') since the nr of values of documenttype would be pretty low and would be used independently of other queries, this would be an excellent candidate for the FQ-param. http://wiki.apache.org/solr/CommonQueryParameters#fq http://wiki.apache.org/solr/CommonQueryParameters#fq This was a longer reply than I wanted to. Really think about your use-cases first, then present some real examples of what you want to achieve and then we can help you in a more useful manner. Cheers, Geert-Jan 2010/7/17 marship mars...@126.com Hi. Peter and All. I merged my indexes today. Now each index stores 10M document. Now I only have 10 solr cores. And I used java -Xmx1g -jar -server start.jar to start the jetty server. At first I deployed them all on one search. The search speed is about 3s. Then I noticed from cmd output when search start, 4 of 10's QTime only cost about 10ms-500ms. The left 5 cost more, up to 2-3s. Then I put 6 on web server, 4 on another(DB, high load most time). Then the search speed goes down to about 1s most time. Now most search takes about 1s. That's great. I watched the jetty output on cmd windows on web server, now when each search start, I saw 2 of 6 costs 60ms-80ms. The another 4 cost 170ms - 700ms. I do believe the bottleneck is still the hard disk. But at least, the search speed at the moment is acceptable. Maybe i should try memdisk to see if that help. And for -Xmx1g, actually I only see jetty consume about 150M memory, consider now the index is 10x bigger. I don't think that works. I googled -Xmx is go enlarge the heap size. Not sure can that help search. I still have 3.5G memory free on server. Now the issue I found is search with fq argument looks slow down the search. Thanks All for your help and suggestions. Thanks. Regards. Scott 在2010-07-17 03:36:19,Peter Karich peat...@yahoo.de 写道: Each solr(jetty) instance on consume 40M-60M memory. java -Xmx1024M -jar start.jar That's a good suggestion! Please, double check that you are using the -server version of the jvm and the latest 1.6.0_20 or so. Additionally you can start jvisualvm (shipped with the jdk) and hook into jetty/tomcat easily to see the current CPU and memory load. But I have 70 solr cores if you ask me: I would reduce them to 10-15 or even less and increase the RAM. try out tomcat too solr distriubted search's speed is decided by the slowest one. so, try to reduce the cores Regards, Peter. you mentioned that you have a lot of mem free, but your yetty containers only using between 40-60 mem. probably stating the obvious, but have you increased the -Xmx param like for instance: java -Xmx1024M -jar start.jar that way you're configuring the container to use a maximum of 1024 MB ram instead of the standard which is much lower (I'm not sure what exactly but it could well be 64MB for non -server, aligning with what you're seeing) Geert-Jan 2010/7/16 marship mars...@126.com Hi Tom Burton-West. Sorry looks my email ISP filtered out your replies. I checked web version of mailing list and saw your reply. My query string is always simple like
Re: Re:Re: How to speed up solr search speed
you mentioned that you have a lot of mem free, but your yetty containers only using between 40-60 mem. probably stating the obvious, but have you increased the -Xmx param like for instance: java -Xmx1024M -jar start.jar that way you're configuring the container to use a maximum of 1024 MB ram instead of the standard which is much lower (I'm not sure what exactly but it could well be 64MB for non -server, aligning with what you're seeing) Geert-Jan 2010/7/16 marship mars...@126.com Hi Tom Burton-West. Sorry looks my email ISP filtered out your replies. I checked web version of mailing list and saw your reply. My query string is always simple like design, principle of design, tom EG: URL: http://localhost:7550/solr/select/?q=designversion=2.2start=0rows=10indent=on Response: response - lst name=responseHeader int name=status0/int int name=QTime16/int - lst name=params str name=indenton/str str name=start0/str str name=qdesign/str str name=version2.2/str str name=rows10/str /lst /lst - result name=response numFound=5981 start=0 - doc str name=idproduct_208619/str /doc EG: http://localhost:7550/solr/select/?q=Principleversion=2.2start=0rows=10indent=on response - lst name=responseHeader int name=status0/int int name=QTime94/int - lst name=params str name=indenton/str str name=start0/str str name=qPrinciple/str str name=version2.2/str str name=rows10/str /lst /lst - result name=response numFound=104 start=0 - doc str name=idproduct_56926/str /doc As I am querying over single core and other cores are not querying at same time. The QTime looks good. But when I query the distributed node: (For this case, 6422ms is still a not bad one. Many cost ~20s) URL: http://localhost:7499/solr/select/?q=the+first+world+warversion=2.2start=0rows=10indent=ondebugQuery=true Response: response - lst name=responseHeader int name=status0/int int name=QTime6422/int - lst name=params str name=debugQuerytrue/str str name=indenton/str str name=start0/str str name=qthe first world war/str str name=version2.2/str str name=rows10/str /lst /lst - result name=response numFound=4231 start=0 Actually I am thinking and testing a solution: As I believe the bottleneck is in harddisk and all our indexes add up is about 10-15G. What about I just add another 16G memory to my server then use MemDisk to map a memory disk and put all my indexes into it. Then each time, solr/jetty need to load index from harddisk, it is loading from memory. This should give solr the most throughout and avoid the harddisk access delay. I am testing But if there are way to make solr use better use our limited resource to avoid adding new ones. that would be great.
Re: How I can use score value for my function
It's possible using functionqueries. See this link. http://wiki.apache.org/solr/FunctionQuery#query 2010/6/29 MitchK mitc...@web.de Ramzesua, this is not possible, because Solr does not know what is the resulting score at query-time (as far as I know). The score will be computed, when every hit from every field is combined by the scorer. Furthermore I have shown you an alternative in the other threads. It makes not exactly what you are describing, but works without a problem. Regards - Mitch -- View this message in context: http://lucene.472066.n3.nabble.com/How-I-can-use-score-value-for-my-function-tp899662p930646.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Setting many properties for a multivalued field. Schema.xml ? External file?
You can treat dynamic fields like any other field, so you can facet, sort, filter, etc on these fields (afaik) I believe the confusion arises that sometimes the usecase for dynamic fields seems to be ill-understood, i.e: to be able to use them to do some kind of wildcard search, e.g: search for a value in any of the dynamic fields at once like pic_url_*. This however is NOT possible. As far as your question goes: Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o pic To the best of my knowledge, everyone is saying that faceting cannot be done on dynamic fields (only on definitive field names). Thus, I tried the following and it's working: I assume that the stored pictures have a sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it means that the underlying doc has at least one picture: ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:* While this is working fine, I'm wondering whether there's a cleaner way to do the same thing without assuming that pictures have a sequential number. If I understand your question correctly: faceting on docs with and without pics could ofcourse by done like you mention, however it would be more efficient to have an extra field defined: hasAtLestOnePic with values (0 | 1) use that to facet / filter on. you can extend this to NrOfPics [0,N) if you need to filter / facet on docs with a certain nr of pics. also I wondered what else you wanted to do with this pic-related info. Do you want to search on pic-description / pic-caption for instance? In that case the dynamic-fields approach may not be what you want: how would you know in which dynamic-field to search for a particular term? Would if be pic_desc_1 , or pic_desc_x? Of couse you could OR over all dynamic fields, but you need to know how many pics an upperbound for the nr of pics and it really doesn't feel right, to me at least. If you need search on pic_description for instance, but don't mind what pic matches, you could create a single field pic_description and put in the concat of all pic-descriptions and search on that, or just make it a a multi-valued field. If you dont need search at all on these fields, the best thing imo is to store all pic-related info of all pics together by concatenating them with some delimiter which you know how to seperate at the client-side. That or just store it in an external RDB since solr is just sitting on the data and not doing anything intelligent with it. I assume btw that you don't want to sort/ facet on pic-desc / pic_caption/ pic_url either ( I have a hard time thinking of a useful usecase for that) HTH, Geert-Jan 2010/6/26 Saïd Radhouani r.steve@gmail.com Thanks so much Otis. This is working great. Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o pic To the best of my knowledge, everyone is saying that faceting cannot be done on dynamic fields (only on definitive field names). Thus, I tried the following and it's working: I assume that the stored pictures have a sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it means that the underlying doc has at least one picture: ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:* While this is working fine, I'm wondering whether there's a cleaner way to do the same thing without assuming that pictures have a sequential number. Also, do you have any documentation about handling Dynamic Fields using SolrJ. So far, I found only issues about that on JIRA, but no documentation. Thanks a lot. -Saïd On Jun 26, 2010, at 1:18 AM, Otis Gospodnetic wrote: Saïd, Dynamic fields could help here, for example imagine a doc with: id pic_url_* pic_caption_* pic_description_* See http://wiki.apache.org/solr/SchemaXml#Dynamic_fields So, for you: dynamicField name=pic_url_* type=string indexed=true stored=true/ dynamicField name=pic_caption_* type=text indexed=true stored=true/ dynamicField name=pic_description_* type=text indexed=true stored=true/ Then you can add docs with unlimited number of pic_(url|caption|description)_* fields, e.g. id pic_url_1 pic_caption_1 pic_description_1 id pic_url_2 pic_caption_2 pic_description_2 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Saïd Radhouani r.steve@gmail.com To: solr-user@lucene.apache.org Sent: Fri, June 25, 2010 6:01:13 PM Subject: Setting many properties for a multivalued field. Schema.xml ? External file? Hi, I'm trying to index data containing a multivalued field picture, that has three properties: url, caption and description: picture/ url/ caption/ description/ Thus, each indexed document might have many pictures, each of them has a url, a caption, and a description. I wonder wether it's
Re: Setting many properties for a multivalued field. Schema.xml ? External file?
If I understand your suggestion correctly, you said that there's NO need to have many Dynamic Fields; instead, we can have one definitive field name, which can store a long string (concatenation of information about tens of pictures), e.g., using - and % delimiters: pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%... I don't clearly see the reason of doing this. Is there a gain in terms of performance? Or does this make programming on the client-side easier? Or something else? I think you should ask the exact opposite question. If you don't do anything with these fields which Solr is particularly good at (searching / filtering / faceting/ sorting) why go through the trouble of creating dynamic fields? (more fields is more overhead cost/ tracking cost no matter how you look at it) Moreover, indeed from a client-view it's easier the way I suggested, since otherwise you: - would have to ask (through SolrJ) to include all dynamic fields to be returned in the Fl-field ( http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult, because a-priori you don't know how many dynamic-fields to query. So in other words you can't just ask SOlr (though SolrJ lik you asked) to just return all dynamic fields beginning with pic_*. (afaik) - your client iterate code (looping the pics) is a bit more involved. HTH, Cheers, Geert-Jan 2010/6/26 Saïd Radhouani r.steve@gmail.com Thanks Geert-Jan for the detailed answer. Actually, I don't search at all on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the number of pictures). Thus, your suggestion of adding an extra field NrOfPics [0,N] would be the best solution. Regarding the other suggestion: If you dont need search at all on these fields, the best thing imo is to store all pic-related info of all pics together by concatenating them with some delimiter which you know how to seperate at the client-side. That or just store it in an external RDB since solr is just sitting on the data and not doing anything intelligent with it. If I understand your suggestion correctly, you said that there's NO need to have many Dynamic Fields; instead, we can have one definitive field name, which can store a long string (concatenation of information about tens of pictures), e.g., using - and % delimiters: pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%... I don't clearly see the reason of doing this. Is there a gain in terms of performance? Or does this make programming on the client-side easier? Or something else? My other question was: in case we use Dynamic Fields, is there a documentation about using SolrJ for this purpose? Thanks -Saïd On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote: You can treat dynamic fields like any other field, so you can facet, sort, filter, etc on these fields (afaik) I believe the confusion arises that sometimes the usecase for dynamic fields seems to be ill-understood, i.e: to be able to use them to do some kind of wildcard search, e.g: search for a value in any of the dynamic fields at once like pic_url_*. This however is NOT possible. As far as your question goes: Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o pic To the best of my knowledge, everyone is saying that faceting cannot be done on dynamic fields (only on definitive field names). Thus, I tried the following and it's working: I assume that the stored pictures have a sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it means that the underlying doc has at least one picture: ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:* While this is working fine, I'm wondering whether there's a cleaner way to do the same thing without assuming that pictures have a sequential number. If I understand your question correctly: faceting on docs with and without pics could ofcourse by done like you mention, however it would be more efficient to have an extra field defined: hasAtLestOnePic with values (0 | 1) use that to facet / filter on. you can extend this to NrOfPics [0,N) if you need to filter / facet on docs with a certain nr of pics. also I wondered what else you wanted to do with this pic-related info. Do you want to search on pic-description / pic-caption for instance? In that case the dynamic-fields approach may not be what you want: how would you know in which dynamic-field to search for a particular term? Would if be pic_desc_1 , or pic_desc_x? Of couse you could OR over all dynamic fields, but you need to know how many pics an upperbound for the nr of pics and it really doesn't feel right, to me at least. If you need search on pic_description for instance, but don't mind what pic matches, you could create a single field
Re: Setting many properties for a multivalued field. Schema.xml ? External file?
btw, be careful with you delimiters: pic_url may possibly contain a '-', etc. 2010/6/26 Geert-Jan Brits gbr...@gmail.com If I understand your suggestion correctly, you said that there's NO need to have many Dynamic Fields; instead, we can have one definitive field name, which can store a long string (concatenation of information about tens of pictures), e.g., using - and % delimiters: pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%... I don't clearly see the reason of doing this. Is there a gain in terms of performance? Or does this make programming on the client-side easier? Or something else? I think you should ask the exact opposite question. If you don't do anything with these fields which Solr is particularly good at (searching / filtering / faceting/ sorting) why go through the trouble of creating dynamic fields? (more fields is more overhead cost/ tracking cost no matter how you look at it) Moreover, indeed from a client-view it's easier the way I suggested, since otherwise you: - would have to ask (through SolrJ) to include all dynamic fields to be returned in the Fl-field ( http://wiki.apache.org/solr/CommonQueryParameters#fl). This is difficult, because a-priori you don't know how many dynamic-fields to query. So in other words you can't just ask SOlr (though SolrJ lik you asked) to just return all dynamic fields beginning with pic_*. (afaik) - your client iterate code (looping the pics) is a bit more involved. HTH, Cheers, Geert-Jan 2010/6/26 Saïd Radhouani r.steve@gmail.com Thanks Geert-Jan for the detailed answer. Actually, I don't search at all on these fields. I'm only filtering (w/ vs w/ pic) and sorting (based on the number of pictures). Thus, your suggestion of adding an extra field NrOfPics [0,N] would be the best solution. Regarding the other suggestion: If you dont need search at all on these fields, the best thing imo is to store all pic-related info of all pics together by concatenating them with some delimiter which you know how to seperate at the client-side. That or just store it in an external RDB since solr is just sitting on the data and not doing anything intelligent with it. If I understand your suggestion correctly, you said that there's NO need to have many Dynamic Fields; instead, we can have one definitive field name, which can store a long string (concatenation of information about tens of pictures), e.g., using - and % delimiters: pic_url_value1-pic_caption_value1-pic_description_value1%pic_url_value2-pic_caption_value2-pic_description_value2%... I don't clearly see the reason of doing this. Is there a gain in terms of performance? Or does this make programming on the client-side easier? Or something else? My other question was: in case we use Dynamic Fields, is there a documentation about using SolrJ for this purpose? Thanks -Saïd On Jun 26, 2010, at 12:29 PM, Geert-Jan Brits wrote: You can treat dynamic fields like any other field, so you can facet, sort, filter, etc on these fields (afaik) I believe the confusion arises that sometimes the usecase for dynamic fields seems to be ill-understood, i.e: to be able to use them to do some kind of wildcard search, e.g: search for a value in any of the dynamic fields at once like pic_url_*. This however is NOT possible. As far as your question goes: Now, I'm trying to make facets on pictures: display doc w/ pic vs. doc w/o pic To the best of my knowledge, everyone is saying that faceting cannot be done on dynamic fields (only on definitive field names). Thus, I tried the following and it's working: I assume that the stored pictures have a sequential number (_1, _2, etc.), i.e., if pic_url_1 exists in the index, it means that the underlying doc has at least one picture: ...facet=onfacet.field=pic_url_1facet.mincount=1fq=pic_url_1:* While this is working fine, I'm wondering whether there's a cleaner way to do the same thing without assuming that pictures have a sequential number. If I understand your question correctly: faceting on docs with and without pics could ofcourse by done like you mention, however it would be more efficient to have an extra field defined: hasAtLestOnePic with values (0 | 1) use that to facet / filter on. you can extend this to NrOfPics [0,N) if you need to filter / facet on docs with a certain nr of pics. also I wondered what else you wanted to do with this pic-related info. Do you want to search on pic-description / pic-caption for instance? In that case the dynamic-fields approach may not be what you want: how would you know in which dynamic-field to search for a particular term? Would if be pic_desc_1 , or pic_desc_x? Of couse you could OR over all dynamic fields, but you need to know how many pics an upperbound for the nr of pics and it really doesn't feel right
Re: Searching across multiple repeating fields
Perhaps my answer is useless, bc I don't have an answer to your direct question, but: You *might* want to consider if your concept of a solr-document is on the correct granular level, i.e: your problem posted could be tackled (afaik) by defining a document being a 'sub-event' with only 1 daterange. So for each event-doc you have now, this is replaced by several sub-event docs in this proposed situation. Additionally each sub-event doc gets an additional field 'parent-eventid' which maps to something like an event-id (which you're probably using) . So several sub-event docs can point to the same event-id. Lastly, all sub-event docs belonging to a particular event implement all the other fields that you may have stored in that particular event-doc. Now you can query for events based on data-rages like you envisioned, but instead of returning events you return sub-event-docs. However since all data of the original event (except the multiple dateranges) is available in the subevent-doc this shouldn't really bother the client. If you need to display all dates of an event (the only info missing from the returned solr-doc) you could easily store it in a RDB and fetch it using the defined parent-eventid. The only caveat I see, is that possibly multiple sub-events with the same 'parent-eventid' might get returned for a particular query. This however depends on the type of queries you envision. i.e: 1) If you always issue queries with date-filters, and *assuming* that sub-events of a particular event don't temporally overlap, you will never get multiple sub-events returned. 2) if 1) doesn't hold and assuming you *do* mind multiple sub-events of the same actual event, you could try to use Field Collapsing on 'parent-eventid' to only return the first sub-event per parent-eventid that matches the rest of your query. (Note however, that Field Collapsing is a patch at the moment. http://wiki.apache.org/solr/FieldCollapsing) Not sure if this helped you at all, but at the very least it was a nice conceptual exercise ;-) Cheers, Geert-Jan 2010/6/22 Mark Allan mark.al...@ed.ac.uk Hi all, Firstly, I apologise for the length of this email but I need to describe properly what I'm doing before I get to the problem! I'm working on a project just now which requires the ability to store and search on temporal coverage data - ie. a field which specifies a date range during which a certain event took place. I hunted around for a few days and couldn't find anything which seemed to fit, so I had a go at writing my own field type based on solr.PointType. It's used as follows: schema.xml fieldType name=temporal class=solr.TemporalCoverage dimension=2 subFieldSuffix=_i/ field name=daterange type=temporal indexed=true stored=true multiValued=true/ data.xml add doc ... field name=daterange1940,1945/field /doc /add Internally, this gets stored as: arr name=daterangestr1940,1945/str/arr int name=daterange_0_i1940/int int name=daterange_1_i1945/int In due course, I'll declare the subfields as a proper date type, but in the meantime, this works absolutely fine. I can search for an individual date and Solr will check (queryDate daterange_0 AND queryDate daterange_1 ) and the correct documents are returned. My code also allows the user to input a date range in the query but I won't complicate matters with that just now! The problem arises when a document has more than one daterange field (imagine a news broadcast which covers a variety of topics and hence time periods). A document with two daterange fields doc ... field name=daterange19820402,19820614/field field name=daterange1990,2000/field /doc gets stored internally as arr name=daterangestr19820402,19820614/strstr1990,2000/str/arr arr name=daterange_0_iint19820402/intint1990/int/arr arr name=daterange_1_iint19820614/intint2000/int/arr In this situation, searching for 1985 should yield zero results as it is contained within neither daterange, however, the above document is returned in the result set. What Solr is doing is checking that the queryDate (1985) is greater than *any* of the values in daterange_0 AND queryDate is less than *any* of the values in daterange_1. How can I get Solr to respect the positions of each item in the daterange_0 and _1 arrays? Ideally I'd like the search to use the following logic, thus preventing the above document from being returned in a search for 1985: (queryDate daterange_0[0] AND queryDate daterange_1[0]) OR (queryDate daterange_0[1] AND queryDate daterange_1[1]) Someone else had a very similar problem recently on the mailing list with a multiValued PointType field but the thread went cold without a final solution. While I could filter the results when they get back to my application layer, it seems like it's not really the right
Re: Sort facet Field by name
facet.sort=false http://wiki.apache.org/solr/SimpleFacetParameters#facet.sort 2010/6/21 Ankit Bhatnagar abhatna...@vantage.com Hi All, I couldn't really figure out if we a have option for sorting the facet field by name in ascending/descending. Any clues? Thanks Ankit
Re: custom scorer in Solr
First of all, Do you expect every query to return results for all 4 buckets? i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2, 1. When displaying the first 10 results, is it ok that these documents potentially all have score 4, and thus only bucket 1 is filled? If so, I can think of the following out-of-the-box option works: (which I'm not sure performs enough, but you can easily test it on your data) following your example create 4 fields: 1. categoryExact - configure anaylzers so that only full matches score, other 2. categoryPartial - configure so that full and partial match (likely you have already configured this) 3. nameExact - like 1 4. namepartial - like 2 configure copyfields: 1 -- 2 and 3 -- 4 this way your indexing client can stay the same as it likely is at the moment. Now you have 4 fields which scores you have to combine on search-time so that the evenual scores are [1,4] Out-of-the-box you can do this with functionqueries. http://wiki.apache.org/solr/FunctionQuery I don't have time to write it down exactly, but for each field: - calc the score of each field (use the Query functionquery (nr 16 in the wiki) . If score 0 use the map function to map it to respectively 4,3,2,1. now for each document you have potentially multiple scores for instance: 4 and 2 if your doc matches exact and partial on category. - use the max functionquery to only return the highest score -- 4 in this case. You have to find out for yourself if this performs though. Hope that helps, Geert-Jan 2010/6/14 Fornoville, Tom tom.fornovi...@truvo.com I've been investigating this further and I might have found another path to consider. Would it be possible to create a custom implementation of a SortField, comparable to the RandomSortField, to tackle the problem? I know it is not your standard question but would really appreciate all feedback and suggestions on this because this is the issue that will make or break the acceptance of Solr for this client. Thanks, Tom -Original Message- From: Fornoville, Tom Sent: woensdag 9 juni 2010 15:35 To: solr-user@lucene.apache.org Subject: custom scorer in Solr Hi all, We are currently working on a proof-of-concept for a client using Solr and have been able to configure all the features they want except the scoring. Problem is that they want scores that make results fall in buckets: * Bucket 1: exact match on category (score = 4) * Bucket 2: exact match on name (score = 3) * Bucket 3: partial match on category (score = 2) * Bucket 4: partial match on name (score = 1) First thing we did was develop a custom similarity class that would return the correct score depending on the field and an exact or partial match. The only problem now is that when a document matches on both the category and name the scores are added together. Example: searching for restaurant returns documents in the category restaurant that also have the word restaurant in their name and thus get a score of 5 (4+1) but they should only get 4. I assume for this to work we would need to develop a custom Scorer class but we have no clue on how to incorporate this in Solr. Maybe there is even a simpler solution that we don't know about. All suggestions welcome! Thanks, Tom
Re: custom scorer in Solr
Just to be clear, this is for the use-case in which it is ok that potentially only 1 bucket gets filled. 2010/6/14 Geert-Jan Brits gbr...@gmail.com First of all, Do you expect every query to return results for all 4 buckets? i.o.w: say you make a Sortfield that sorts for score 4 first, than 3, 2, 1. When displaying the first 10 results, is it ok that these documents potentially all have score 4, and thus only bucket 1 is filled? If so, I can think of the following out-of-the-box option works: (which I'm not sure performs enough, but you can easily test it on your data) following your example create 4 fields: 1. categoryExact - configure anaylzers so that only full matches score, other 2. categoryPartial - configure so that full and partial match (likely you have already configured this) 3. nameExact - like 1 4. namepartial - like 2 configure copyfields: 1 -- 2 and 3 -- 4 this way your indexing client can stay the same as it likely is at the moment. Now you have 4 fields which scores you have to combine on search-time so that the evenual scores are [1,4] Out-of-the-box you can do this with functionqueries. http://wiki.apache.org/solr/FunctionQuery I don't have time to write it down exactly, but for each field: - calc the score of each field (use the Query functionquery (nr 16 in the wiki) . If score 0 use the map function to map it to respectively 4,3,2,1. now for each document you have potentially multiple scores for instance: 4 and 2 if your doc matches exact and partial on category. - use the max functionquery to only return the highest score -- 4 in this case. You have to find out for yourself if this performs though. Hope that helps, Geert-Jan 2010/6/14 Fornoville, Tom tom.fornovi...@truvo.com I've been investigating this further and I might have found another path to consider. Would it be possible to create a custom implementation of a SortField, comparable to the RandomSortField, to tackle the problem? I know it is not your standard question but would really appreciate all feedback and suggestions on this because this is the issue that will make or break the acceptance of Solr for this client. Thanks, Tom -Original Message- From: Fornoville, Tom Sent: woensdag 9 juni 2010 15:35 To: solr-user@lucene.apache.org Subject: custom scorer in Solr Hi all, We are currently working on a proof-of-concept for a client using Solr and have been able to configure all the features they want except the scoring. Problem is that they want scores that make results fall in buckets: * Bucket 1: exact match on category (score = 4) * Bucket 2: exact match on name (score = 3) * Bucket 3: partial match on category (score = 2) * Bucket 4: partial match on name (score = 1) First thing we did was develop a custom similarity class that would return the correct score depending on the field and an exact or partial match. The only problem now is that when a document matches on both the category and name the scores are added together. Example: searching for restaurant returns documents in the category restaurant that also have the word restaurant in their name and thus get a score of 5 (4+1) but they should only get 4. I assume for this to work we would need to develop a custom Scorer class but we have no clue on how to incorporate this in Solr. Maybe there is even a simpler solution that we don't know about. All suggestions welcome! Thanks, Tom
Re: Tips on recursive xml-parsing in dataConfig
my bad, it looks like XPathEntityProcessor doesn't support relative xpaths. However, I quickly looked at the Slashdot example (which is pretty good actually) at http://wiki.apache.org/solr/DataImportHandler. From that I infer that you use only 1 entity per xml-doc. And within that entity use multiple field declararations with xpath-attributes to extract the values you want. So even though your xml-dcoument is nested (like most xml's are) your field-declarations are not. I think your best bet is to read the slashdot example and go from there. For now, I'm not entirely sure what you want a solr-document to be in your example. i.e: - 1 solr-document per 1 xml-document (as supplied) - or 1 solr-doc per CHAP per PARA or per SUB? Once you know that, perhaps coming up with a decent pointer is easier. HTH, Geert-Jan http://wiki.apache.org/solr/DataImportHandler 2010/6/8 Tor Henning Ueland tor.henn...@gmail.com I have tried both to change the datasource per child node to use the parent nodes name, and tried to making the Xpath`s relative, all causing either exceptions telling that Xpath must start with /, or nullpointer exceptions ( nsfgrantsdir document : null). Best regards On Mon, Jun 7, 2010 at 4:12 PM, Geert-Jan Brits gbr...@gmail.com wrote: I'm guessing (I'm not familiar with the xml dataimport handler, but I am pretty familiar with Xpath) that your problem lies in having absolute xpath-queries, instead of relative xpath queries to your parent node. e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try 'KAP' instead. The same for all xpaths deeper in the tree. Geert-Jan 2010/6/7 Tor Henning Ueland tor.henn...@gmail.com Hi, I am doing some testing of dataimport to Solr from XML-documents with many children in the children. To parse the children i some levels down using Xpath goes fine, but the speed is very slow. (~1 minute per document, on a quad Xeon server). When i do the same using the format solr wants it, the parsing time is 0.02 seconds per document. I have published a quick example here: http://pastebin.com/adhcEvRx My question is: I hope that i have done something wrong in the child-parsing (as you can see, it goes down quite a few levels). Can anybody point me in the right direction so i can speed up the process? I have been looking around for some examples, but nobody gives examples of such deep data indexing. PS: I know there are some bugs in the Xpath naming etc, but it is just a rough example :) -- Best regars Tor Henning Ueland -- Mvh Tor Henning Ueland
Re: Tips on recursive xml-parsing in dataConfig
I'm guessing (I'm not familiar with the xml dataimport handler, but I am pretty familiar with Xpath) that your problem lies in having absolute xpath-queries, instead of relative xpath queries to your parent node. e.g: /DOK/TEKST/KAP is absolute ( the prefixed '/' tells it to be). Try 'KAP' instead. The same for all xpaths deeper in the tree. Geert-Jan 2010/6/7 Tor Henning Ueland tor.henn...@gmail.com Hi, I am doing some testing of dataimport to Solr from XML-documents with many children in the children. To parse the children i some levels down using Xpath goes fine, but the speed is very slow. (~1 minute per document, on a quad Xeon server). When i do the same using the format solr wants it, the parsing time is 0.02 seconds per document. I have published a quick example here: http://pastebin.com/adhcEvRx My question is: I hope that i have done something wrong in the child-parsing (as you can see, it goes down quite a few levels). Can anybody point me in the right direction so i can speed up the process? I have been looking around for some examples, but nobody gives examples of such deep data indexing. PS: I know there are some bugs in the Xpath naming etc, but it is just a rough example :) -- Best regars Tor Henning Ueland
Re: exclude docs with null field
Additionally, I should have mentioned that you can instead do: fq=field_3:[* TO *], which uses the filtercache. The method presented by Chris will probably outperform the above method but only on the first request, from then on the filtercache takes over. From a performance standpoint it's probably not worth going the 'default value for null-approach' imho. It IS useful however if you want to be able to query on docs with a null-value (instead of excluding them) 2010/6/4 bluestar sea...@butterflycluster.net nice one! thanks. i could be wrong but it seems this way has a performance hit? or i am missing something? Did you read Chris's message in http://search-lucene.com/m/1o5mEk8DjX1/ He proposes alternative (more efficient) way other than [* TO *]
Re: MultiValue Exclusion
I guess the following works. A. similar to your option 2, but using the filtercache fq=-item_id:001 -item_id:002 B. similar to your option 3, but using the filtercache fq=-users_excluded_field:userid the advantage being that the filter is cached independently from the rest of the query so it can be reused efficiently. adv A over B. the 'muted news items' can be queried dynamically, i.e: they aren't set in stone at index time. B will probably perform a little bit better the first time (when nog cached), but I'm not sure. hope that helps, Geert-Jan 2010/6/4 homerlex homerlex.nab...@gmail.com How would you model this? We have a table of news items that people can view in their news stream and comment on. Users have the ability to mute item so they never see them in their feed or search results. From what I can see there are a couple ways to accomplish this. 1 - Post process the results and do not render any muted news items. The downside of the pagination become problematic. Its possible we may forgo pagination because of this but for now assume that pagination is a requirement. 2 - Whenever we query for a given user we append a clause that excludes all muted items. I assume in Solr we'd need to do something like -item_id(1 AND 2 AND 3). Obviously this doesn't scale very well. 3 - Have a multi-valued property in the index that contains all ids of users who have muted the item. Being new to Solr I don't even know how (or if its possible) to run a query that says user id not this multivalued property. Can this even be done (sample query please)? Again, I know this doesn't scale very well. Any other suggestions? Thanks in advance for the help. -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-Exclusion-tp870173p870173.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Regarding Facet Date query using SolrJ -- Not getting any examples to start with.
Hi Ninad, SolrQuery q = new SolrQuery(); q.setQuery(*:*); q.setFacet(true); q.set(facet.data, pub); q.set(facet.date.start, 2000-01-01T00:00:00Z) ... etc. basically you can completely build your entire query with the 'raw' set (and add) methods. The specific methods are just helpers. So this is the same as above: SolrQuery q = new SolrQuery(); q.set(q,*:*); q.set(facet,true); q.set(facet.data, pub); q.set(facet.date.start, 2000-01-01T00:00:00Z) ... etc. Geert-Jan 2010/6/2 Ninad Raut hbase.user.ni...@gmail.com Hi, I want to hit the query given below : ?q=*:*facet=truefacet.date=pubfacet.date.start=2000-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1YEAR using SolrJ. I am browsing the net but not getting any clues about how should I approach it. How can SolJ API be used to create above mentioned Query. Regards, Ninad R
Re: Interleaving the results
Indeed, it's just a matter of ordening the results on the client-side IFF I infer correctly from your description that you are guarenteed to get results from enough different customers from SOlr in the first place to do the interleaving that you describe. (In general this is a pretty big IF). So assuming that's the case, you just make sure to return the customerid as part of the solr-result (make sure the customerid is stored) (or get the customerid through other means e.g: look it up in a db based on the id of the doc returned). Finally, simply code the interleaving (for example: throw the results in something like Mapcustomerid, Listdocid and iterate the map, so you get the first element of each list then the 2nd, etc... 2010/6/1 NarasimhaRaju rajux...@yahoo.com Can some body throw some ideas, on how to achieve (interleaving) from with in the application especially in a distributed setup? “ There are only 10 types of people in this world:- Those who understand binary and those who don’t “ Regards, P.N.Raju, From: Lance Norskog goks...@gmail.com To: solr-user@lucene.apache.org Sent: Sat, May 29, 2010 3:04:46 AM Subject: Re: Interleaving the results There is no interleaving tool. There is a random number tool. You will have to achive this in your application. On Fri, May 28, 2010 at 8:23 AM, NarasimhaRaju rajux...@yahoo.com wrote: Hi, how to achieve custom ordering of the documents when there is a general query? Usecase: Interleave documents from different customers one after the other. Example: Say i have 10 documents in the index belonging to 3 customers (customer_id field in the index ) and using query *:* so all the documents in the results score the same. but i want the results to be interleaved one document from the each customer should appear before a document from the same customer repeats ? is there a way to achieve this ? Thanks in advance R. -- Lance Norskog goks...@gmail.com
Re: Sites with Innovative Presentation of Tags and Facets
NP ;-) . Just to explain: With tooltips I meant js-tooltips (not the native webbrowser tooltips) since sliders require JS anyway, presenting additional info in a Js-tooltip on drag, doesn't limit the nr of people able to view it. I think this is ok from a usability standpoint since I don't consider the 'nr of items left' info 100% essential (after all lots of sites do well without it at the moment). Call if graceful degradation ;-) As for mobile, I never realized that 'hover' is an issue on mobile, but on drag is supported on mobile touch displays... Moreover, having a navigational-complex site like kayak.com / tripadvisor.com to work well on mobile (from a usability perspective) is pretty much an utopia anyway. For these types of sites, specialized mobile sites (or apps as is the case for the above brands) are the way to go in my opinion. Geert-Jan 2010/5/28 Mark Bennett mbenn...@ideaeng.com Haha! Important tooltips are now deprecated in Web Applications. This is nothing official, of course. But it's being advised to avoid important UI tasks that require cursor tracking, mouse-over, hovering, etc. in web applications. Why? Many touch-centric mobile devices don't support hover. For me I'm used to my laptop where the touch pad or stylus *is* able to measure the pressure. But the finger based touch devices generally can differenciate it I guess. They *can* tell one gesture from another, but only looking at the timing and shape. And hapless hover aint one of them. With that said, I'm still a fan of Tool Tips in desktop IDE's like Eclipse, or even on Web applications when I'm on a desktop. I guess the point is that, if it's a really important thing, then you need to expose it in another way on mobile. Just passing this on, please don't shoot the messenger. ;-) Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Thu, May 27, 2010 at 2:55 PM, Geert-Jan Brits gbr...@gmail.com wrote: Perhaps you could show the 'nr of items left' as a tooltip of sorts when the user actually drags the slider. If the user doesn't drag (or hovers over ) the slider 'nr of items left' isn't shown. Moreover, initially a slider doesn't limit the results so 'nr of items left' shown for the slider would be the same as the overall number of items left (thereby being redundant) I must say I haven't seen this been implemented but it would be rather easy to adapt a slider implementation, to show the nr on drag/ hover. (they exit for jquery, scriptaculous and a bunch of other libs) Geert-Jan 2010/5/27 Lukas Kahwe Smith m...@pooteeweet.org On 27.05.2010, at 23:32, Geert-Jan Brits wrote: Something like sliders perhaps? Of course only numerical ranges can be put into sliders. (or a concept that may be logically presented as some sort of ordening, such as bad, hmm, good, great Use Solr's Statscomponent to show the min and max values Have a look at tripadvisor.com for good uses/implementation of sliders (price, and reviewscore are presented as sliders) my 2c: try to make the possible input values discrete (like at tripadvisor) which gives a better user experience and limits the potential nr of queries (cache-wise advantage) yeah i have been pondering something similar. but i now realized that this way the user doesnt get an overview of the distribution without actually applying the filter. that being said, it would be nice to display 3 numbers with the silders, the count of items that were filtered out on the lower and upper boundaries as well as the number of items still left (*). aside from this i just put a little tweak to my facetting online: http://search.un-informed.org/search?q=malariatm=anys=Search if you deselect any of the checkboxes, it updates the counts. however i display both the count without and with those additional checkbox filters applied (actually i only display two numbers of they are not the same): http://screencast.com/t/MWUzYWZkY2Yt regards, Lukas Kahwe Smith m...@pooteeweet.org (*) if anyone has a slider that can do the above i would love to integrate that and replace the adoption year checkboxes with that
Re: Sites with Innovative Presentation of Tags and Facets
Interesting.. say you have a double slider with a discrete range (like tripadvisor et.al.) perhaps it would be a good guideline to use these discrete points for the quantum interval for the sparkline as well? Of course it then becomes the question which discrete values to use for the slider. I tend to follow what tripadvisor does for it's price-slider: set a cap for the max price, and set a fixed interval ($25) for the discrete steps. (of course there are edge cases like when no product hits the maximum capped price) I have also seen non-linear steps implemented, but I guess this doesn't go well with the notion of sparlines. Anyway, from a implementation standpoint it would be enough for Solr to return the 'nr of items' per interval. From that, it would be easy to calculate on the application-side the 'nr of items' for each possible slider-combination. getting these values from solr would require (staying with the price-example): - a new discretised price field. And doing a facet.field. - the (continu) price field already present, and doing 50 facet queries (if you have 50 steps) - another more elegant way ;-) . Perhaps an addition to statscomponent that returns all counts within a discrete (to be specified) step? Would this slow the statscomponent-code down a lot, or ir the info already (almost) present in statscomponent for doing things as calculating sddev / means, etc? - something I'm completely missing... 2010/5/28 Chris Hostetter hossman_luc...@fucit.org : Perhaps you could show the 'nr of items left' as a tooltip of sorts when the : user actually drags the slider. Years ago, when we were first working on building Solr, a coworker of mind suggested using double bar sliders (ie: pick a range using a min and a max) for all numeric facets and putting sparklines above them to give the user a visual indication of the spread of documents across the numeric spectrum. it wsa a little more complicated then anything we needed -- and seemed like a real pain in hte ass to implement. i still don't know of anyone doing anything like that, but it's definitley an interesting idea. The hard part is really just deciding what quantum interval you want to use along the xaxis to decide how to count the docs for the y axis. http://en.wikipedia.org/wiki/Sparkline http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR -Hoss
Re: Sites with Innovative Presentation of Tags and Facets
May I ask how you implemented getting the facet counts for each interval? Do you use a facet-query per interval? And perhaps for inspiration a link to the site you implemented this .. Thanks, Geert-Jan I love the idea of a sparkline at range-sliders. I think if I have time, I might add them to the range sliders on our site. I already have all the data since I show the count for a range while the user is dragging by storing the facet counts for each interval in javascript.
Re: Sites with Innovative Presentation of Tags and Facets
Something like sliders perhaps? Of course only numerical ranges can be put into sliders. (or a concept that may be logically presented as some sort of ordening, such as bad, hmm, good, great Use Solr's Statscomponent to show the min and max values Have a look at tripadvisor.com for good uses/implementation of sliders (price, and reviewscore are presented as sliders) my 2c: try to make the possible input values discrete (like at tripadvisor) which gives a better user experience and limits the potential nr of queries (cache-wise advantage) Cheers, Geert-Jan 2010/5/27 Mark Bennett mbenn...@ideaeng.com I'm a big fan of plain old text facets (or tags), displayed in some logical order, perhaps with a bit of indenting to help convey context. But as you may have noticed, I don't rule the world. :-) Suppose you took the opposite approach, rending facets in non-traditional ways, that were still functional, and not ugly. Are there any pubic sites that come to mind that are displaying facets, tags, clusters, taxonomies or other navigators in really innovative ways? And what you liked / didn't like? Right now I'm just looking for examples of what's been tried. I suppose even bad examples might be educational. My future ideal wish list: * Stays out of the way (of casual users) * Looks clean and cool (to the power users) I'm thinking for example a light gray chevron that casual users don't notice, but when you click on it, cool things come up? * Probably that does not require Flash or SilverLight (just to avoid the whole platform wars) I guess that means Ajax or HTML5 * And since I'm doing pie in the sky, can be made to look good on desktops and mobile Some examples to get the ball rolling: StackOverflow, Flickr and YouTube, Clusty(now Yippy) are all nice, but a bit pedestrian for my mission today. (grokker was cool too) Lucid has done a nice job with Facets and Solr: http://www.lucidimagination.com/search/ And although I really like it, it's not a flashy enough specimen for what I'm hunting today. (and they should thread the actual results list) I did some mockups of 2.0 style search navigators a couple years back: http://www.ideaeng.com/tabId/98/itemId/115/Search-20-in-the-Enterprise-Moving-Beyond-Singl.aspx Though these were intentionally NOT derived from specific web sites. Digg has done some cool stuff, for example: http://labs.digg.com/365/ http://labs.digg.com/arc/ http://labs.digg.com/stack/ But for what I'm after, these are a bit too far off of the searching for something in particular track. Google Image Swirl and Similar Images are interesting, but for images. Lots of other cool stuff at labs.google.com Amazon, NewEgg, etc are all fine, but again text based. TouchGraph has some cool stuff, though very non-linear (many others on this theme) http://www.touchgraph.com/TGGoogleBrowser.html http://www.touchgraph.com/navigator.html Cool articles on the subject: (some examples now offline) http://www.cs.umd.edu/class/spring2005/cmsc838s/viz4all/viz4all_a.html -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
Re: Sites with Innovative Presentation of Tags and Facets
Perhaps you could show the 'nr of items left' as a tooltip of sorts when the user actually drags the slider. If the user doesn't drag (or hovers over ) the slider 'nr of items left' isn't shown. Moreover, initially a slider doesn't limit the results so 'nr of items left' shown for the slider would be the same as the overall number of items left (thereby being redundant) I must say I haven't seen this been implemented but it would be rather easy to adapt a slider implementation, to show the nr on drag/ hover. (they exit for jquery, scriptaculous and a bunch of other libs) Geert-Jan 2010/5/27 Lukas Kahwe Smith m...@pooteeweet.org On 27.05.2010, at 23:32, Geert-Jan Brits wrote: Something like sliders perhaps? Of course only numerical ranges can be put into sliders. (or a concept that may be logically presented as some sort of ordening, such as bad, hmm, good, great Use Solr's Statscomponent to show the min and max values Have a look at tripadvisor.com for good uses/implementation of sliders (price, and reviewscore are presented as sliders) my 2c: try to make the possible input values discrete (like at tripadvisor) which gives a better user experience and limits the potential nr of queries (cache-wise advantage) yeah i have been pondering something similar. but i now realized that this way the user doesnt get an overview of the distribution without actually applying the filter. that being said, it would be nice to display 3 numbers with the silders, the count of items that were filtered out on the lower and upper boundaries as well as the number of items still left (*). aside from this i just put a little tweak to my facetting online: http://search.un-informed.org/search?q=malariatm=anys=Search if you deselect any of the checkboxes, it updates the counts. however i display both the count without and with those additional checkbox filters applied (actually i only display two numbers of they are not the same): http://screencast.com/t/MWUzYWZkY2Yt regards, Lukas Kahwe Smith m...@pooteeweet.org (*) if anyone has a slider that can do the above i would love to integrate that and replace the adoption year checkboxes with that
Re: Personalized Search
Just want to throw this in: If you're worried about scaling, etc. you could take a look at item-based collaborative filtering instead of user based. i.e: DO NIGHTLY/ BATCH: - calculate the similarity between items based on their properties DO ON EACH REQUEST - have a user store/update it's interest as a vector of item-properties. How to update this based on click / browse behavior is the interesting thing and depends a lot on your environment. - Next is to recommend 'neighboring' items that are close to the defined 'interest-vector'. The code is similar to user-based colab. filtering, but scaling is invariant to the nr of users. other advantages: - new items/ products can be recommended as soon as they are added to the catalog (no need for users to express interest in them before the item can be suggested) disadvantage: - top-N results tend to be less dynamic then when using user-based colab. filtering. Of course, this doesn't touch on how to integrate this with Solr. Perhaps some combination with Mahout is indeed the best solution. I haven't given this much thought yet I must say. For info on Mahout Taste (+ an explanation on item-based filtering vs. user-based filtering) see: http://lucene.apache.org/mahout/taste.html Cheers, Geert-Jan 2010/5/21 Rih tanrihae...@gmail.com - keep the SOLR index independent of bought/like - have a db table with user prefs on a per item basis I have the same idea this far. at query time, specify boosts for 'my items' items I believe this works if you want to sort results by faved/not faved. But how does it scale if users already favorited/liked hundreds of item? The query can be quite long. Looking forward to your idea. On Thu, May 20, 2010 at 6:37 PM, dc tech dctech1...@gmail.com wrote: Another approach would be to do query time boosts of 'my' items under the assumption that count is limited: - keep the SOLR index independent of bought/like - have a db table with user prefs on a per item basis - at query time, specify boosts for 'my items' items We are planning to do this in the context of document management where documents in 'my (used/favorited ) folders' provide a boost factor to the results. On 5/20/10, findbestopensource findbestopensou...@gmail.com wrote: Hi Rih, You going to include either of the two field bought or like to per member/visitor OR a unique field per member / visitor? If it's one or two common fields are included then there will not be any impact in performance. If you want to include unique field then you need to consider multi value field otherwise you certainly hit the wall. Regards Aditya www.findbestopensource.com On Thu, May 20, 2010 at 12:13 PM, Rih tanrihae...@gmail.com wrote: Has anybody done personalized search with Solr? I'm thinking of including fields such as bought or like per member/visitor via dynamic fields to a product search schema. Another option is to have a multi-value field that can contain user IDs. What are the possible performance issues with this setup? Looking forward to your ideas. Rih -- Sent from my mobile device
Re: seemingly impossible query
Would each Id need to return a different doc? If not: you could probably use FieldCollapsing: http://wiki.apache.org/solr/FieldCollapsing http://wiki.apache.org/solr/FieldCollapsingi.e: - collapse on listOfIds (see wiki entry for syntax) - constrain the field to only return the id's you want e.g: q= listOfIds:10 OR q= listOfIds:5,...,OR q= listOfIds:56 Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Re: seemingly impossible query
Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
Re: limit rows by field
I believe you're talking about Fieldcollapsing. It's available as a patch, although I'm not sure how well it applies to the current trunk. for more info check out: http://wiki.apache.org/solr/FieldCollapsing http://wiki.apache.org/solr/FieldCollapsingGeert-Jan 2010/4/13 Felix Zimmermann feliz...@gmx.de Hi, for a preview of results, I need to display up to 3 documents per category. Is it possible to limit the number of rows of solr response by field-values? What I mean is: rows: 9 -(sub)rows of field:cat1 : 3 -(sub)rows of field:cat2 : 3 -(sub)rows of field:cat3 : 3 If not, is there a workaround or do I have to send three queries? Thanks! felix
Re: Impossible Boost Query?
Have a look at functionqueries. http://wiki.apache.org/solr/FunctionQuery http://wiki.apache.org/solr/FunctionQueryYou could for instance use your regular score and multiply it with RandomValueSource bound between 1.0 and 1.1 for example. This would at least break ties in a possibly natural looking manner. (btw: this would still influence all documents however) //Geert-Jan 2010/3/26 Blargy zman...@hotmail.com Ok so this is basically just a random sort. Anyway I can get this to randomly sort documents that closely related and not the rest of the results? -- View this message in context: http://n3.nabble.com/Impossible-Boost-Query-tp472080p580214.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi Select Facets through Java API
something like this? q=mainqueryfq={!tag=carfq}cars:corvette OR cars:camarofacet=onfacet.field={!ex=carfq key=carfacet}cars -the facet: carfacet is indepedennt of the filter query that filters on cars. -you construct the filter query (fq={!tag=carfq}cars:corvette OR cars:camaro) yourself in your application layer. perhaps a disadvantage is that you get a lot of different filter queries which are all independently cached... I don't see any other way at the moment though.. Geert-Jan 2010/3/22 homerlex nab...@mlecza.newnetco.com bump - anyone? -- View this message in context: http://old.nabble.com/Multi-Select-Facets-through-Java-API-tp27951014p27986301.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Will Solr fit our needs?
If you dont' plan on filtering/ sorting and/or faceting on fast-changing fields it would be better to store them outside of solr/lucene in my opinion. If you must: for indexing-performance reasons you will probably end up with maintaining seperate indices (1 for slow-changing/static fields and 1 for fast-changing-fields) . You frequently commit the fast-changing -index to incorporate the changes in current_price. Afterwards you have 2 options I believe: 1. use parallelreader to query the seperate indices directly. Afaik, this is not (completely) integrated in Solr... I wouldn't recommend it. 2. after you commit the fast-changing-index, merge with the static-index. You're left with 1 fresh index, which you can push to your slave-servers. (all this in regular interverals) Disadvatages: - In any way, you must be very careful with maintaining multiple parallel indexes with the purpose of treating them as one. For instance document inserts must be done exactly in the same order, otherwise the indices go 'out-of-sync' and are unusable. - higher maintenance - there is always a time-window in which the current_price values are stale. If that's within reqs that's ok. The other path, which I recommend, would be to store the current_price outside of solr (like you're currently doing) but instead of using a relational db, try looking into persistent key-value stores. Many of them exist and a lot of progress has been made in the last couple of years. For simple key-lookups (what you need as far as I can tell) they really blow every relational db out of the water (considering the same hardware of course) We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant and seeing almost a 5x increased in lookup performance compared to our previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was already several times faster than our mysql-setup (although not optimally tuned) . to sum things up: use the best tools for what they were meant to do. - index/search -- solr/ lucene without a doubt. - kv-lookup -- consensus is still forming, and a lot of players (with a lot of different types of functionality) but if all you need is simple key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the moment. Please note that TC and competitors aren't just some code/ hobby projects but are usually born out of a real need at huge websites / social networks such as TC which is born from mixi (big social network in Japan) . So at least you're in good company.. for kv-stores I would suggest to begin your research at: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ (beginning 2009) http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores (half 2009) and get a feel of the kv-playing field. Hope this (pretty long) post helps, Geert-Jan 2010/3/17 Krzysztof Grodzicki krzysztof.grodzi...@iterate.pl Hi Mortiz, You can take a look on the project ZOIE - http://code.google.com/p/zoie/. I think it's that what are you looking for. br Krzysztof On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.de wrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Re: Implementing hierarchical facet
you could always define 1 dynamicfield and encode the hierarchy level in the fieldname: dynamicField name=_loc_hier_* type=string stored=false indexed=true omitNorms=true/ using: facet=onfacet.field={!key=Location}_loc_hier_cityfq=_loc_hier_country:somecountryid ... adding cityarea later for instance would be as simple as: facet=onfacet.field={!key=Location}_loc_hier_cityareafq=_loc_hier_city:somecityid Cheers, Geert-Jan 2010/3/3 Andy angelf...@yahoo.com Thanks. I didn't know about the {!key=Location} trick. Thanks everyone for your help. From what I could gather, there're 3 approaches: 1) SOLR-64 Pros: - can have arbitrary levels of hierarchy without modifying schema Cons: - each combination of all the levels in the hierarchy will result in a separate filter cache. This number could be huge, which would lead to poor performance 2) SOLR-792 Pros: - each level of the hierarchy separately results in filter cache. Much smaller number of filter cache. Better performance. Cons: - Only 2 levels are supported 3) Separate fields for each hierarchy levels Pros: - same as SOLR-792. Good performance Cons: - can only handle a fixed number of levels in the hierarchy. Adding any levels beyond that requires schema modification Does that sound right? Option 3 is probably the best match for my use case. Is there any trick to make it able to deal with arbitrary number of levels? Thanks. --- On Tue, 3/2/10, Geert-Jan Brits gbr...@gmail.com wrote: From: Geert-Jan Brits gbr...@gmail.com Subject: Re: Implementing hierarchical facet To: solr-user@lucene.apache.org Date: Tuesday, March 2, 2010, 8:02 PM Using Solr 1.4: even less changes to the frontend: facet=onfacet.field={!key=Location}countryid ... facet=onfacet.field={!key=Location}cityidfq=countryid:somecountryid etc. will consistently render the resulting facet under the name Location . 2010/3/3 Geert-Jan Brits gbr...@gmail.com If it's a requirement to let Solr handle the facet-hierarchy please disregard this post, but an alternative would be to have your App control when to ask for which 'facet-level' (e.g: country, state, city) in the hierarchy. as follows, each doc has 3 seperate fields (indexed=true, stored=false): - countryid - stateid - cityid facet on country: facet=onfacet.field=countryid facet on state ( country selected. functionally you probably don't want to show states without the user having selected a country anyway) facet=onfacet.field=countryidfq=countryid:somecountryid facet on city (state selected, same functional analogy as above) facet=onfacet.field=cityidfq=stateid:somestateid or facet on city (countryselected, same functional analogy as above) facet=onfacet.field=cityidfq=countryid:somecountryid grab the resulting facat and drop it under Location pros: - reusing fq's (good performance, I've never used hierarchical facets, but would be surprised if it has a (major) speed increase to this method) - flexible (you get multiple hierarchies: country -- state -- city and country -- city) cons: - a little more application logic Hope that helps, Geert-Jan 2010/3/2 Andy angelf...@yahoo.com I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like level1level2level3 with as the separator. A problem with this approach is that the number of facet values will greatly increase. For example I have a facet Location with the hierarchy countrystatecity. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value. Ideally Solr would be aware of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on. Is it possible to implement hierarchical facet this way using Solr?
Re: Implementing hierarchical facet
If it's a requirement to let Solr handle the facet-hierarchy please disregard this post, but an alternative would be to have your App control when to ask for which 'facet-level' (e.g: country, state, city) in the hierarchy. as follows, each doc has 3 seperate fields (indexed=true, stored=false): - countryid - stateid - cityid facet on country: facet=onfacet.field=countryid facet on state ( country selected. functionally you probably don't want to show states without the user having selected a country anyway) facet=onfacet.field=countryidfq=countryid:somecountryid facet on city (state selected, same functional analogy as above) facet=onfacet.field=cityidfq=stateid:somestateid or facet on city (countryselected, same functional analogy as above) facet=onfacet.field=cityidfq=countryid:somecountryid grab the resulting facat and drop it under Location pros: - reusing fq's (good performance, I've never used hierarchical facets, but would be surprised if it has a (major) speed increase to this method) - flexible (you get multiple hierarchies: country -- state -- city and country -- city) cons: - a little more application logic Hope that helps, Geert-Jan 2010/3/2 Andy angelf...@yahoo.com I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like level1level2level3 with as the separator. A problem with this approach is that the number of facet values will greatly increase. For example I have a facet Location with the hierarchy countrystatecity. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value. Ideally Solr would be aware of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on. Is it possible to implement hierarchical facet this way using Solr?
Re: Implementing hierarchical facet
Using Solr 1.4: even less changes to the frontend: facet=onfacet.field={!key=Location}countryid ... facet=onfacet.field={!key=Location}cityidfq=countryid:somecountryid etc. will consistently render the resulting facet under the name Location . 2010/3/3 Geert-Jan Brits gbr...@gmail.com If it's a requirement to let Solr handle the facet-hierarchy please disregard this post, but an alternative would be to have your App control when to ask for which 'facet-level' (e.g: country, state, city) in the hierarchy. as follows, each doc has 3 seperate fields (indexed=true, stored=false): - countryid - stateid - cityid facet on country: facet=onfacet.field=countryid facet on state ( country selected. functionally you probably don't want to show states without the user having selected a country anyway) facet=onfacet.field=countryidfq=countryid:somecountryid facet on city (state selected, same functional analogy as above) facet=onfacet.field=cityidfq=stateid:somestateid or facet on city (countryselected, same functional analogy as above) facet=onfacet.field=cityidfq=countryid:somecountryid grab the resulting facat and drop it under Location pros: - reusing fq's (good performance, I've never used hierarchical facets, but would be surprised if it has a (major) speed increase to this method) - flexible (you get multiple hierarchies: country -- state -- city and country -- city) cons: - a little more application logic Hope that helps, Geert-Jan 2010/3/2 Andy angelf...@yahoo.com I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like level1level2level3 with as the separator. A problem with this approach is that the number of facet values will greatly increase. For example I have a facet Location with the hierarchy countrystatecity. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value. Ideally Solr would be aware of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on. Is it possible to implement hierarchical facet this way using Solr?