RE: How to 'filter' facet results
ManBearPig is still a threat. -Kallin Nagelberg -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Tuesday, July 27, 2010 7:44 PM To: solr-user@lucene.apache.org Subject: RE: How to 'filter' facet results Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I run this query: fq=keyword:man OR keyword:bear OR keyword:pig facet=on facet.field:keyword then I only want it to return the facet counts for man, bear, and pig. The resulting docs might have a number of different values for keyword, in addition For the general case of filtering facet values, I've wanted to do that too in more complex situations, and there is no good way I've found. For your very specific use case though, yeah, you can do it with facet.query. Leave out the facet.field, but instead: facet.query=keyword:man facet.query=keyword:bear facet.query=keyword:pig You'll get three facet.query results in the response, one each for man, bear, pig. Solr behind the scenes will kind of do three seperate 'sub-queries', one for each facet.query, but since the query itself should be cached, you shouldn't notice much difference. Especially if you have a warming query that facets on the keyword field (I'm never entirely sure when caches created by warming queries will be used by a facet.query, or if it depends on the facet method in use, but it can't hurt). Jonathan
solrj occasional timeout on commit
Hey, I recently moved a solr app from a testing environment into a production environment, and I'm seeing a brand new error which never occurred during testing. I'm seeing this in the solrJ-based app logs: org.apache.solr.common.SolrException: com.caucho.vfs.SocketTimeoutException: client timeout com.caucho.vfs.SocketTimeoutException: client timeout request: http://somehost:8080/solr/live/update?wt=javabinversion=1 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) This occurs in a service that periodically adds new documents to solr. There are 4 boxes that could be doing updates in parallel. In testing there were 2. We're running on a new Resin 4 based install on production, whereas we were using resin 3 in testing. Does anyone have any ideas. Help would be greatly appreciated! Thanks, -Kallin Nagelberg
RE: help with a schema design problem
I think you just want something like: p_value:Pramod AND p_type:Supplier no? -Kallin Nagelberg -Original Message- From: Pramod Goyal [mailto:pramod.go...@gmail.com] Sent: Friday, July 23, 2010 2:17 PM To: solr-user@lucene.apache.org Subject: help with a schema design problem Hi, Lets say i have table with 3 columns document id Party Value and Party Type. In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type: Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier. Now in this table if i use SQL its easy for me find all document with Party Value as Pramod and Party Type as Client. I need to design solr schema so that i can do the same in Solr. If i create 2 fields in solr schema Party value and Party type both of them multi valued and try to query +Pramod +Supplier then solr will return me the first document, even though in the first document Pramod is a client and not a supplier Thanks, Pramod Goyal
RE: help with a schema design problem
When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. Would it? I would expect it to give you nothing. -Kal -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Friday, July 23, 2010 5:05 PM To: solr-user@lucene.apache.org Subject: Re: help with a schema design problem Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. No, I'm 99% sure there is not. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. yep, for the use-case you mentioned that would definitely work. Multivalued of course, so it can contain Supplier Raj as well. 2010/7/23 Pramod Goyal pramod.go...@gmail.com In my case the document id is the unique key( each row is not a unique document ) . So a single document has multiple Party Value and Party Type. Hence i need to define both Party value and Party type as mutli-valued. Is there any way in solr to say p_value[someIndex]=pramod And p_type[someIndex]=client. Is there any other way i can design my schema ? I have some solutions but none seems to be a good solution. One way would be to define a single field in the schema as p_value_type = client pramod i.e. combine the value from both the field and store it in a single field. On Sat, Jul 24, 2010 at 12:18 AM, Geert-Jan Brits gbr...@gmail.com wrote: With the usecase you specified it should work to just index each Row as you described in your initial post to be a seperate document. This way p_value and p_type all get singlevalued and you get a correct combination of p_value and p_type. However, this may not go so well with other use-cases you have in mind, e.g.: requiring that no multiple results are returned with the same document id. 2010/7/23 Pramod Goyal pramod.go...@gmail.com I want to do that. But if i understand correctly in solr it would store the field like this: p_value: Pramod Raj p_type: Client Supplier When i search p_value:Pramod AND p_type:Supplier it would give me result as document 1. Which is incorrect, since in document 1 Pramod is a Client and not a Supplier. On Fri, Jul 23, 2010 at 11:52 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: I think you just want something like: p_value:Pramod AND p_type:Supplier no? -Kallin Nagelberg -Original Message- From: Pramod Goyal [mailto:pramod.go...@gmail.com] Sent: Friday, July 23, 2010 2:17 PM To: solr-user@lucene.apache.org Subject: help with a schema design problem Hi, Lets say i have table with 3 columns document id Party Value and Party Type. In this table i have 3 rows. 1st row Document id: 1 Party Value: Pramod Party Type: Client. 2nd row: Document id: 1 Party Value: Raj Party Type: Supplier. 3rd row Document id:2 Party Value: Pramod Party Type: Supplier. Now in this table if i use SQL its easy for me find all document with Party Value as Pramod and Party Type as Client. I need to design solr schema so that i can do the same in Solr. If i create 2 fields in solr schema Party value and Party type both of them multi valued and try to query +Pramod +Supplier then solr will return me the first document, even though in the first document Pramod is a client and not a supplier Thanks, Pramod Goyal
RE: faceted search with job title
Yeah you should definitely just setup a custom parser for each site.. should be easy to extract title using groovy's xml parsing along with tagsoup for sloppy html. If you can't find the pattern for each site leading to the job title how can you expect solr to? Humans have the advantage here :P -Kallin Nagelberg -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: Wednesday, July 21, 2010 12:20 PM To: solr-user@lucene.apache.org Cc: dave.sea...@magicalia.com Subject: Re: faceted search with job title mmm...there must be better way...each job board has different format. If there are constantly new job boards being crawled, I don't think I can manually look for specific sequence of tags that leads to job title. Most of them don't even have class or id. There is no guarantee that the job title will be in the title tag, or header tag. Something else can be in the title. Should I do this in a class that extends IndexFilter in Nutch? Thanks. From: Dave Searle dave.sea...@magicalia.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Wed, July 21, 2010 8:42:55 AM Subject: RE: faceted search with job title You'd probably need to do some post processing on the pages and set up rules for each website to grab that specific bit of data. You could load the html into an xml parser, then use xpath to grab content from a particular tag with a class or id, based on the particular website -Original Message- From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] Sent: 21 July 2010 16:38 To: solr-user@lucene.apache.org Subject: faceted search with job title Hi, I am currently using nutch to crawl some job pages from job boards. They are in my solr index now. I want to do faceted search with the job titles. How? The job titles can be in any locations of the page, e.g. title, header, content... If I use indexfilter in Nutch to search the content for job title, there are hundred of thousands of job titles, I can't hard code them all. Do you have a better idea? I think I need the job title in a separate field in the index to make it work with solr faceted search, am I right? Thanks.
RE: how to eliminating scoring from a query?
How about: 1. Create a date field to indicate indextime. 2 Use a date filter to restrict articles to today and yesterday such as myindexdate:[NOW/DAY-1DAY TO NOW/DAY+1DAY] 3. sort on that field. -Kallin Nagelberg -Original Message- From: oferiko [mailto:ofer...@gmail.com] Sent: Thursday, July 15, 2010 1:38 PM To: solr-user@lucene.apache.org Subject: Re: how to eliminating scoring from a query? thanks, i want it to be the indexing order, but with a limit, something like everything that matches my query, and was indexed since yesterday, in an ascending order. Ofer On Thu, Jul 15, 2010 at 8:25 PM, Erick Erickson [via Lucene] ml-node+970139-889457701-316...@n3.nabble.comml-node%2b970139-889457701-316...@n3.nabble.com wrote: By specifying a sort that doesn't include score. I think it's just automatic then. It wouldn't make sense to eliminate scoring *without* sorting by some other field , you'd essentially get a random ordering. Best Erick On Thu, Jul 15, 2010 at 1:43 AM, oferiko [hidden email]http://user/SendEmail.jtp?type=nodenode=970139i=0 wrote: in http://www.lucidimagination.com/files/file/LIWP_WhatsNew_Solr1.4.pdf http://www.lucidimagination.com/files/file/LIWP_WhatsNew_Solr1.4.pdf under the performance it mentions: Queries that don't sort by score can eliminate scoring, which speeds up queries how exactly can i do that? If i don't mention which sort i want, it automatically sorts by score desc. thanks -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p968581.htmlhttp://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p968581.html?by-user=t Sent from the Solr - User mailing list archive at Nabble.com. -- View message @ http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p970139.html To unsubscribe from how to eliminating scoring from a query?, click here (link removed) =. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-eliminating-scoring-from-a-query-tp968581p970180.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: limiting the total number of documents matched
So you want to take the top 1000 sorted by score, then sort those by another field. It's a strange case, and I can't think of a clean way to accomplish it. You could do it in two queries, where the first is by score and you only request your IDs to keep it snappy, then do a second query against the IDs and sort by your other field. 1000 seems like a lot for that approach, but who knows until you try it on your data. -Kallin Nagelberg -Original Message- From: Paul [mailto:p...@nines.org] Sent: Wednesday, July 14, 2010 4:16 PM To: solr-user Subject: limiting the total number of documents matched I'd like to limit the total number of documents that are returned for a search, particularly when the sort order is not based on relevancy. In other words, if the user searches for a very common term, they might get tens of thousands of hits, and if they sort by title, then very high relevancy documents will be interspersed with very low relevancy documents. I'd like to set a limit to the 1000 most relevant documents, then sort those by title. Is there a way to do this? I guess I could always retrieve the top 1000 documents and sort them in the client, but that seems particularly inefficient. I can't find any other way to do this, though. Thanks, Paul
RE: Help patching Solr
I'm pretty sure you need to be running the patch against a checkout of the trunk sources, not a generated .war file. Once you've done that you can use the build scripts to make a new war. -Kallin Nagelberg -Original Message- From: Moazzam Khan [mailto:moazz...@gmail.com] Sent: Tuesday, June 15, 2010 1:53 PM To: solr-user@lucene.apache.org Subject: Help patching Solr Hey guys, Does anyone know how to patch stuff in Windows? I am trying to patch Solr with patch 238 but it keeps erroring out with this message: C:\solr\example\webappspatch solr.war ..\..\SOLR-236-trunk.patch patching file solr.war Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. Thanks in advance Moazzam
RE: index growing with updates
Ok so I think that Solr (lucene) will only remove deleted/updated documents from the disk after an optimize or after an 'expungeDeletes' request. Is there a way to trigger the expunsion (new word) across the entire index? I tried : final UpdateRequest request = new UpdateRequest() request.setParam(expungeDeletes,true); request.add someofmydocs server.sendrequest.. But that didn't seem to do the trick as I know I have about 7 Gigs of documents that should be removed from the disk and the index size hasn't really budged. Any ideas? Thanks, Kallin Nagelberg -Original Message- From: Nagelberg, Kallin Sent: Thursday, June 03, 2010 1:36 PM To: 'solr-user@lucene.apache.org' Subject: RE: index growing with updates Is there a way to trigger a purge, or under what conditions does it occur? -Kallin Nagelberg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, June 03, 2010 12:40 PM To: solr-user@lucene.apache.org Subject: Re: index growing with updates Assuming your config is set up to replace unique keys, you're really doing a delete and an add (under the covers). It could very well be that the deleted version of the document is still in your index taking up space and will be until it is purged. HTH Erick On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Hey, If I add a document to the index that already exists (same uniquekey) what is the expected behavior? I would imagine that if the document is the same then the index should not grow, but mine appears to be growing. Any ideas? Thanks, -Kallin Nagelberg
RE: index growing with updates
Is there a way to trigger a purge, or under what conditions does it occur? -Kallin Nagelberg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, June 03, 2010 12:40 PM To: solr-user@lucene.apache.org Subject: Re: index growing with updates Assuming your config is set up to replace unique keys, you're really doing a delete and an add (under the covers). It could very well be that the deleted version of the document is still in your index taking up space and will be until it is purged. HTH Erick On Thu, Jun 3, 2010 at 10:22 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Hey, If I add a document to the index that already exists (same uniquekey) what is the expected behavior? I would imagine that if the document is the same then the index should not grow, but mine appears to be growing. Any ideas? Thanks, -Kallin Nagelberg
RE: general debugging techniques?
How much memory have you given tomcat? The default is 64M which is going to be really small for 5MB documents. -Original Message- From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo Sent: Thursday, June 03, 2010 2:05 PM To: solr-user@lucene.apache.org Subject: general debugging techniques? I am new to debugging Java services, so I'm wondering what the best practices are for debugging solr on tomcat. I'm running into a few issues while building up my index, using the ExtractingRequestHandler to format the data from my sources. I can read through the catalina log, but this seems to just log requests; not much info is given about errors or when the service hangs. Here are some examples: Some zip or Office formats uploaded to the extract requestHandler simply hang with the jsvc process spinning at 100% CPU. I'm unclear where in the process the request is hanging. Did it make it through Tika? Is it attempting to index? The problem is often not reproducible after restarting tomcat and starting with the last failed document. Although I am keeping document size under 5MB, I regularly see SEVERE: java.lang.OutOfMemoryError: Java heap space errors. How can I find what component had this problem? After the above error, I often see this followup error on the next document: SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/ index/lucene-d6f7b3bf6fe64f362b4d45bfd4924f54-write.lock . This has a backtrace, so I could dive directly into the code. Is this the best way to track down the problem, or are there debugging settings that could help show why the lock is being held elsewhere? I attempted to turn on indexing logging with the line infoStream file=INFOSTREAM.txttrue/infoStream but I can't seem to find this file in either the tomacat or the index directory. I am using solr 3.1 with the patch to work with Tika 0.7. Thanks for any tips, Jim
RE: general debugging techniques?
That is still really small for 5MB documents. I think the default solr document cache is 512 items, so you would need at least 3 GB of memory if you didn't change that and the cache filled up. Try disabling the document cache by removing the documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ block from your solrconfig, or at least turn it down to like 5 documents. -Kal -Original Message- From: jim.bl...@pbwiki.com [mailto:jim.bl...@pbwiki.com] On Behalf Of Jim Blomo Sent: Thursday, June 03, 2010 2:29 PM To: solr-user@lucene.apache.org Subject: Re: general debugging techniques? On Thu, Jun 3, 2010 at 11:17 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: How much memory have you given tomcat? The default is 64M which is going to be really small for 5MB documents. -Xmx128M - my understanding is that this bumps heap size to 128M. What is a reasonable size? Are there other memory flags I should specify? Jim
RE: Storing different entities in Solr
Good read here: http://mysolr.com/tips/denormalized-data-structure/ . Are consultation requests unique to each consultant? In that case you could represent the request as a Json String and store it as a multi-valued string field for each consultant, though that makes querying against requests trickier. If you need to search against specific fields in the consultant requests than you could try a schema where the consultant is your primary entity and have fields like consultantrequests-field1, consultantrequests-field2, consultantrequests-field3 and then one consultantrequests-fulljson all multi-valued. You could query against the specific fields, then associate to the whole request by searching the json object. It's an approach I've used with success. -Kallin Nagelberg -Original Message- From: Moazzam Khan [mailto:moazz...@gmail.com] Sent: Friday, May 28, 2010 12:17 PM To: solr-user@lucene.apache.org Subject: Storing different entities in Solr Hi Guys, Is there a way to store 2 types of things in Solr. We have a list of consultants and a list of consultation requests. and I want to store them as separate documents. Can I do this with one instance of Solr or do I have to have two instances? Thanks, MOazzam
RE: Storing different entities in Solr
Multi-core is an option, but keep in mind if you go that route you will need to do two searches to correlate data between the two. -Kallin Nagelberg -Original Message- From: Robert Zotter [mailto:robertzot...@gmail.com] Sent: Friday, May 28, 2010 12:26 PM To: solr-user@lucene.apache.org Subject: Re: Storing different entities in Solr Sounds like you'll want to use a multiple core setup. One core fore each type of document http://wiki.apache.org/solr/CoreAdmin -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Storing different entities in Solr
I agree with Erick, Could you show us what these two entities look like, and the total count of each? That might shed some light on the appropriate approach. -Kallin Nagelberg -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, May 28, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Storing different entities in Solr You most certainly *can* store the many-many relationship, you are just denormalizing your data. I know it goes against the grain of any good database admin, but it's very often a good solution for a search application. You've gotta forget almost everything you learned about how data *should* be stored in databases when working with a search app. Well, perhaps I'm overstating a bit, but you get the idea When I see messages about primary keys and foreign keys etc, I break out in hives. It's almost always a mistake to try to force lucene/solr to behave like a database. Whenever you find yourself trying, stop, take a deep breath, and think about searching G... A lot depends on how much data we're talking about here. If fully denormalizing things would cost you 10M, who cares? If it would cost you 100G, it's a different story Best Erick On Fri, May 28, 2010 at 1:12 PM, Moazzam Khan moazz...@gmail.com wrote: Thanks for all your answers guys. Requests and consultants have a many to many relationship so I can't store request info in a document with advisorID as the primary key. Bill's solution and multicore solutions might be what I am looking for. Bill, will I be able to have 2 primary keys (so I can update and delete documents)? If yes, can you please give me a link or someting where I can get more info on this? Thanks, Moazzam On Fri, May 28, 2010 at 11:50 AM, Bill Au bill.w...@gmail.com wrote: You can keep different type of documents in the same index. If each document has a type field. You can restrict your searches to specific type(s) of document by using a filter query, which is very fast and efficient. Bill On Fri, May 28, 2010 at 12:28 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Multi-core is an option, but keep in mind if you go that route you will need to do two searches to correlate data between the two. -Kallin Nagelberg -Original Message- From: Robert Zotter [mailto:robertzot...@gmail.com] Sent: Friday, May 28, 2010 12:26 PM To: solr-user@lucene.apache.org Subject: Re: Storing different entities in Solr Sounds like you'll want to use a multiple core setup. One core fore each type of document http://wiki.apache.org/solr/CoreAdmin -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-different-entities-in-Solr-tp852299p852346.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Any realtime indexing plugin available for SOLR
I'm afraid nothing is completely 'real-time'. Even when doing your inserts on the database there is time taken for those operations to complete. Right now I have my solr server autocommiting every 30 seconds, which is 'real-time' enough for me. You need to figure out what your threshold is, and then tune your index, hardware, caching to achieve it. If you don't want the results to show up in the database before the search you could store an 'indexed' value in the DB which you flip after you've indexed the new data. -Kallin Nagelberg -Original Message- From: bbarani [mailto:bbar...@gmail.com] Sent: Wednesday, May 26, 2010 10:39 AM To: solr-user@lucene.apache.org Subject: Any realtime indexing plugin available for SOLR Hi, Sorry if I am asking this question again in this forum.. Is there any plugin which I can use to do a realtime indexing? I have a requirement where we have an application which sits on top of SQL server DB and updates happen on day to day basis. Users would like to see the changes made to the DB immediately in the search results. I am thinking of using JMS queue for achieving this, but before that I just want to check if anyone has implemented similar kind of requirement before? Any help / suggestions would be greatly appreciated. Thanks, bb -- View this message in context: http://lucene.472066.n3.nabble.com/Any-realtime-indexing-plugin-available-for-SOLR-tp845026p845026.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: How real-time are Solr/Lucene queries?
Searching is very fast with Solr, but no way as fast as keying into a map. There is possibly disk I/O if your document isn't cached. Your situation sounds unique enough I think you're going to need to prototype to see if it meets your demands. Figure out how 'fast' is 'fast' for your application, and then see if you can hit your targets. Once you have some real numbers and queries you'll be able to get more meaningful feedback from the community I imagine. -Kallin Nagelberg -Original Message- From: Thomas J. Buhr [mailto:t...@superstringmedia.com] Sent: Wednesday, May 26, 2010 11:30 AM To: solr-user@lucene.apache.org Subject: Re: How real-time are Solr/Lucene queries? What about my situation? My renderers need to query the index for fast access to layout and style info as I already described about 3 messages ago on this thread. Another scenario is having automatic queries triggered as my midi player iterates through the model. As the player encounters trigger tags it needs to make a query quickly so that the next notes played will have the context they are meant to have. Basically, I need to know that issuing searches to a local index will not be slower than searching a hashmap or array. How different or similar will the performance be? Thom On 2010-05-26, at 9:41 AM, Walter Underwood wrote: On May 25, 2010, at 11:24 PM, Amit Nithian wrote: 2) What are typical/accepted definitions of Real Time vs Near Real Time? Real time means that an update is available in the next query after it commits. Near real time means that the delay is small, but not zero. This is within a single server. In a cluster, there will be some communication delay. 3) I could understand POSTing a document to a server and then turning around and searching for it on the same server but what about a replicated environment and how do you prevent caches from being blown and constantly re-warmed (hence performance degradation)? You need a different caching design, with transaction-aware caches that are at a lower level, closer to the indexes. wunder -- Walter Underwood Lead Engineer MarkLogic
RE: seemingly impossible query
I developed a solution to this problem and I thought I should share it in case someone encounters a similar problem. Recap: My problem was that for every document in my index I needed to know if it was the most recent that contained an ID in a multi-valued field. Doing this for one ID was simple (id:${myId} sort:date asc, rows=1). It is much more difficult to do this for a set of ids at the same time, in my case up to 100. If I try 'id:id1 or id:id2 or id:id3... sort=date asc rows=11' I may not get a match for every ID in my query. IE, with a query of 100 unique IDs, 100 Rows, I might only find 75 of those uniqueIds in the response. My solution is to pre-calculate this information. I created a new multi-valued field, mostRecentForIds, and store in that field all of the IDS for which this document is the most recent. Each ID will only appear once in the index in this field, allowing me to obtain my 100 unique Id response when querying with 100 unique IDs. I also created a Boolean field, 'isPostProcessed' which is set to false when a new doc is added. Then, on a cron, I select all documents with isPostProcessed:false, and perform the precalculation logic on all the ids stored in the resultset, and updating isPostProcessed:false. The downside to this approach is that every document must be indexed twice. I could not perform the logic before the initial index since there could be other unindexed documents in a forthcoming commit that would conflict. Hopefully someone finds this useful eventually! -Kallin Nagelberg -Original Message- From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] Sent: Friday, May 21, 2010 4:44 PM To: 'solr-user@lucene.apache.org' Subject: RE: seemingly impossible query I just realized something that may make the fieldcollapsing strategy insufficient. My 'ids' field is multi-valued. From what I've read you cannot field collapse on a multi-valued field. Any other ideas? Thanks, -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been
RE: seemingly impossible query
I just realized something that may make the fieldcollapsing strategy insufficient. My 'ids' field is multi-valued. From what I've read you cannot field collapse on a multi-valued field. Any other ideas? Thanks, -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
field collapsing on multi-valued field
As I understand from looking at https://issues.apache.org/jira/login.jsp?os_destination=/browse/SOLR-236 field collapsing has been disabled on multi-valued fields. Is this really necessary? Let's say I have a multi-valued field, 'my-mv-field'. I have a query like (my-mv-field:1 OR my-mv-field:5) that returns docs with the following values for 'my-mv-field': Doc1: 1, 2, 3, Doc2: 1, 3 Doc3: 2, 4, 5, 6 Doc4: 1 If I collapse on that field with that query I imagine it should mean 'collect the docs, starting from the top, so that I find 1 and 5'. In this case if it returned Doc1 and Doc3 I would be happy. There must be some ambiguity or implementation detail I am unaware that is preventing this. It may be a critical piece of functionality for an application I'm working on, so I'm curious if there is point in pursuing development of this functionality or if I am missing something. Thanks, Kallin Nagelberg
seemingly impossible query
Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: Machine utilization while indexing
How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs ram, and a doc size of 5-10k maybe substantially larger than what you have. They could be substantially smaller too. As another point of reference my index ends up being about 20Gigs with the 5 million docs. I should also point out I only need to do this once.. I'm not constantly reindexing everything. My indexed documents rarely change, and when they do we have a process that selectively updates those few that need it. Combine that with a constant trickle of new documents and indexing performance isn't much of a concern. You should be able to experiment with a small subset of your documents to speedily test new schemas, etc. In my case I selected a representative sample and store them in my project for unit testing. -Kallin Nagelberg -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: RE: Machine utilization while indexing It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing
RE: Machine utilization while indexing
You're sure it's not blocking on indexing IO? If not then I guess it must be a thread waiting unnecessarily in solr or your loading program. To get my loader running at full speed I hooked it up to jprofiler's thread views to see where the stalls were and optimized from there. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: seemingly impossible query
Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: seemingly impossible query
Thanks, I'm going to take a look at fieldcollapsingquery as it seems like it should do the trick! -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: Machine utilization while indexing
StreamingUpdateSolrServer already has multiple threads and uses multiple connections under the covers. At least the api says ' Uses an internal MultiThreadedHttpConnectionManager to manage http connections'. The constructor allows you to specify the number of threads used, http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String, int, int) . -Kallin Nagelberg -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, May 20, 2010 3:14 PM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
RE: seemingly impossible query
Yeah this looks perfect. Too bad it's not in 1.4, I guess I can build from trunk and patch it. This is probably a stupid question but is there any feeling as to when 1.5 might come out? Thanks, -Kallin Nagelberg -Original Message- From: Geert-Jan Brits [mailto:gbr...@gmail.com] Sent: Thursday, May 20, 2010 1:03 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Hi Kallin, again please look at FieldCollapsinghttp://wiki.apache.org/solr/FieldCollapsing , that should do the trick. basically: first you constrain the field: 'listOfIds' to only contain docs that contain any of the (up to) 100 random ids as you know how to do Next, in the same query, specify to collapse on field 'listOfIds ' basically: q=listOfIds:1 OR listOfIds:10 OR listOfIds:24 collapse.threshold=1collapse.field=listOfIdscollapse.type=normal this would return the top-matching doc for each id left in listOfIds. Since you constrained this field by the ids specified you are left with 1 matching doc for each id. Again it is not guarenteed that all docs returned are different. Since you didn't specify this as a requirement I think this will suffics. Cheers, Geert-Jan 2010/5/20 Nagelberg, Kallin knagelb...@globeandmail.com Yeah I need something like: (id:1 and maxhits:1) OR (id:2 and maxits:1).. something crazy like that.. I'm not sure how I can hit solr once. If I do try and do them all in one big OR query then I'm probably not going to get a hit for each ID. I would need to request probably 1000 documents to find all 100 and even then there's no guarantee and no way of knowing how deep to go. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:27 PM To: solr-user@lucene.apache.org Subject: RE: seemingly impossible query I see. Well, now you're asking Solr to ignore its prime directive of returning hits that match a query. Hehe. I'm not sure if Solr has a unique attribute. But this sounds, to me, like you will have to filter the results yourself. But at least you hit Solr only once before doing so. Good luck! Thanks Darren, The problem with that is that it may not return one document per id, which is what I need. IE, I could give 100 ids in that OR query and retrieve 100 documents, all containing just 1 of the IDs. -Kallin Nagelberg -Original Message- From: dar...@ontrenet.com [mailto:dar...@ontrenet.com] Sent: Thursday, May 20, 2010 12:21 PM To: solr-user@lucene.apache.org Subject: Re: seemingly impossible query Ok. I think I understand. What's impossible about this? If you have a single field name called id that is multivalued then you can retrieved the documents with something like: id:1 OR id:2 OR id:56 ... id:100 then add limit 100. There's probably a more succinct way to do this, but I'll leave that to the experts. If you also only want the documents within a certain time, then you also create a time field and use a conjunction (id:0 ...) AND time:NOW-1H or something similar to this. Check the query syntax wiki for specifics. Darren Hey everyone, I've recently been given a requirement that is giving me some trouble. I need to retrieve up to 100 documents, but I can't see a way to do it without making 100 different queries. My schema has a multi-valued field like 'listOfIds'. Each document has between 0 and N of these ids associated to them. My input is up to 100 of these ids at random, and I need to retrieve the most recent document for each id (N Ids as input, N docs returned). I'm currently planning on doing a single query for each id, requesting 1 row, and caching the result. This could work OK since some of these ids should repeat quite often. Of course I would prefer to find a way to do this in Solr, but I'm not sure it's capable. Any ideas? Thanks, -Kallin Nagelberg
RE: Challenge: Searching for variant products and get basic products in result set
I agree that pulling all attributes into the parent sku during indexing could work well. Define a Boolean field like 'isVirtual' to identify the non-leaf skus, and use a multi-valued field for each of the attributes. For now you can do a search like (isVirtual:true AND doorType:screen). If at a later date you want the actual variants just search for isVirtual:false. Does that work? -Kallin Nagelberg -Original Message- From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] Sent: Wednesday, May 19, 2010 11:13 AM To: solr-user@lucene.apache.org Subject: Re: Challenge: Searching for variant products and get basic products in result set if that is so, and maybe, you have for example, two variants of cars with automatic, what would define on which one was the hit? or field dont share common information across variants? if they do share, you wouldnt be able to define in which one was the hit(because it was on both of them) and would either have to pick one randomly, or retrieve both. if they dont share that info, you would have that covered, since only one would match any given query. On Wed, May 19, 2010 at 5:04 PM, hkmortensen ko...@yahoo.com wrote: thanks. Currently not, but requirements change all the time as always ;-) If we get a requirement, that a facet shall be material of doors, we will need to know which variant was the hit. I would like to be prepared for that. Leonardo Menezes wrote: would you then need to know in which variant was your match produced? because if not, you can just index the whole thing as one single document... On Wed, May 19, 2010 at 4:23 PM, hkmortensen ko...@yahoo.com wrote: I do searching for products. Each base product exist in variants as well. One variant has a glass door, another a steel door etc. The variants can have diffent prices. The base product does not really exist, only the variants exists IRL. The case corresponds to cars: the car model is the base product, with color variants or with automatic/manual etc. I want to search for variants, but I only want to have base products in the result. Ie when one or more variants from the same base product are found, only the base product shall be in the search result. Does somebody have an idea how this could be done? Best regards Henning -- View this message in context: http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829218.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829319.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Challenge: Searching for variant products and get basic products in result set
Sorry, in North America 'sku' (stock keeping unit) is the common term in business to specifically identify a particular product, http://lmgtfy.com/?q=sku. And yes, I think you understand me. I am imagining you can structure your products in a hierarchy. For each node in the tree you traverse all children, collecting their attributes into the current node. -Kallin Nagelberg -Original Message- From: hkmortensen [mailto:ko...@yahoo.com] Sent: Wednesday, May 19, 2010 11:39 AM To: solr-user@lucene.apache.org Subject: RE: Challenge: Searching for variant products and get basic products in result set sorry, what does sku mean? I understand you like this: indexing base and variants, and include all atributes (for one base and its variants) in each document. I think that would work. Thanks. Nagelberg, Kallin wrote: I agree that pulling all attributes into the parent sku during indexing could work well. Define a Boolean field like 'isVirtual' to identify the non-leaf skus, and use a multi-valued field for each of the attributes. For now you can do a search like (isVirtual:true AND doorType:screen). If at a later date you want the actual variants just search for isVirtual:false. Does that work? -Kallin Nagelberg -Original Message- From: Leonardo Menezes [mailto:leonardo.menez...@googlemail.com] Sent: Wednesday, May 19, 2010 11:13 AM To: solr-user@lucene.apache.org Subject: Re: Challenge: Searching for variant products and get basic products in result set if that is so, and maybe, you have for example, two variants of cars with automatic, what would define on which one was the hit? or field dont share common information across variants? if they do share, you wouldnt be able to define in which one was the hit(because it was on both of them) and would either have to pick one randomly, or retrieve both. if they dont share that info, you would have that covered, since only one would match any given query. On Wed, May 19, 2010 at 5:04 PM, hkmortensen ko...@yahoo.com wrote: thanks. Currently not, but requirements change all the time as always ;-) If we get a requirement, that a facet shall be material of doors, we will need to know which variant was the hit. I would like to be prepared for that. Leonardo Menezes wrote: would you then need to know in which variant was your match produced? because if not, you can just index the whole thing as one single document... On Wed, May 19, 2010 at 4:23 PM, hkmortensen ko...@yahoo.com wrote: I do searching for products. Each base product exist in variants as well. One variant has a glass door, another a steel door etc. The variants can have diffent prices. The base product does not really exist, only the variants exists IRL. The case corresponds to cars: the car model is the base product, with color variants or with automatic/manual etc. I want to search for variants, but I only want to have base products in the result. Ie when one or more variants from the same base product are found, only the base product shall be in the search result. Does somebody have an idea how this could be done? Best regards Henning -- View this message in context: http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829218.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829319.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Challenge-Searching-for-variant-products-and-get-basic-products-in-result-set-tp829218p829435.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: disable caches in real time
I suppose you are still losing some performance on the replicated box since it needs to use some resources to warm the cache. It would be nice if a warmed cache could be replicated from the master though perhaps that's not practical. Chris is right though: The newly updated index created by a commit is not seen by users until it has been warmed, at which point it is atomically swapped. -Kallin Nagelberg -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, May 19, 2010 2:38 PM To: solr-user@lucene.apache.org Subject: Re: disable caches in real time : I've always undestand that if you do a commit (replication does it), a new : searcher is open, and you lose performance (queries per second) while the : caches are regenerated. I think i don't explain correctly my situation not if you configure your caches with autowarming -- then solr will warm up the new caches (on the new index) while the old index still serves requests -- this is all manged for you by the SolrCore, no need for core swapping. -Hoss
confused by simple OR
I must be missing something very obvious here. I have a filter query like so: (-rootdir:somevalue) I get results for that filter However, when I OR it with another term like so I get nothing: ((-rootdir:somevalue) OR (rootdir:somevalue AND someboolean:true)) How is this possible? Have I gone mad? Thanks, Kallin Nagelberg
RE: confused by simple OR
Awesome that works, thanks Ahmet. -Kallin Nagelberg -Original Message- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Thursday, May 13, 2010 12:24 PM To: solr-user@lucene.apache.org Subject: Re: confused by simple OR I must be missing something very obvious here. I have a filter query like so: (-rootdir:somevalue) I get results for that filter However, when I OR it with another term like so I get nothing: ((-rootdir:somevalue) OR (rootdir:somevalue AND someboolean:true)) Simply you cannot combine NOT and OR clauses like you did. It should be something like: ((+*:* -rootdir:somevalue) OR (rootdir:somevalue AND someboolean:true))
maximum recommended document cache size
I am trying to tune my Solr setup so that the caches are well warmed after the index is updated. My documents are quite small, usually under 10k. I currently have a document cache size of about 15,000, and am warming up 5,000 with a query after each indexing. Autocommit is set at 30 seconds, and my caches are warming up easily in just a couple of seconds. I've read of concerns regarding garbage collection when your cache is too large. Does anyone have experience with this? Ideally I would like to get 90% of all documents from the last month in memory after each index, which would be around 25,000. I'm doing extensive load testing, but if someone has recommendations I'd love to hear them. Thanks, -Kallin Nagelberg
RE: strange behaviour when sorting, fields are missing in result
I'm not sure I understand how your results are truncated. They both find 21502 documents. The fact that you are sorting on '_erstelldatum' ascending and not seeing any results for that field on the first page leads me to think that you have 'sortMissingLast=false' on that field's fieldType. In that case it would put all the documents missing the '_erstelldatum' first. -Kallin Nagelberg -Original Message- From: markus.rietz...@rzf.fin-nrw.de [mailto:markus.rietz...@rzf.fin-nrw.de] Sent: Wednesday, May 12, 2010 9:00 AM To: solr-user@lucene.apache.org Subject: strange behaviour when sorting, fields are missing in result when i do a search, eg. http://xxx:8983/solr/select?q=steuerfl=score,id,__intern,title,__source,_dienststelle,_erstelldatum,__cyear,_stelle i get a normal result, like result name=response numFound=21502 start=0 maxScore=1.3633566 doc float name=score1.3633566/float int name=__cyear2009/int str name=__intern0/str str name=__sourcezzz/str str name=_dienststellexyz/str long name=_erstelldatum2009020200/long str name=_stellePresse- u. Informationsreferat/str str name=id34931684/str str name=titleMerkblatt Vereine und Steuern/str /doc when i do a search with the sort param, my result is suddenly truncated: http://xxx:8983/solr/select?q=steuerfl=score,id,__intern,title,__source,_dienststelle,_erstelldatum,__cyear,_stellesort=_erstelldatum+asc result name=response numFound=21502 start=0 maxScore=1.3633566 doc float name=score0.14290115/float str name=__intern0/str str name=__sourceisys/str str name=id18205520/str str name=titleAmtsübersicht /str /doc so, not all of the fields from fl-param are displayed. this is what admin schema browser says about _erstelldatum: Field Type: long Properties: Indexed, Tokenized, Stored, Omit Norms, undefined Schema: Indexed, Tokenized, Stored, Omit Norms, undefined Index: Stored, Omit Norms, Binary Index Analyzer: org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory Query Analyzer: org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory Docs: 118496 Distinct: 11477 Top Terms termfrequency 2009072100 655 2003111000 500 2006110800 428 2010012900 412 2006032000 356 2003062000 354 2010041500 313 2010043000 310 2010030100 296 2008110112 260 and this for validFrom Field Type: long Properties: Indexed, Tokenized, Stored, Omit Norms, undefined Schema: Indexed, Tokenized, Stored, Omit Norms, undefined Index: Stored, Omit Norms, Binary Index Analyzer: org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory Query Analyzer: org.apache.solr.analysis.TokenizerChain Details Tokenizer Class: org.apache.solr.analysis.TrieTokenizerFactory Docs: 111762 Distinct: 66649 Top Terms termfrequency 2002101700 315 2003111000 309 2002102100 293 2009042312 258 20060320152000 229 2005060700 227 2010041500 215 2007010100 207 2005061000 205 2010012900 200 i have checked all of our fields, with some of them sort works and with some of them it doesn't work. this is our finding: (+ means it works, - it doesn't work) _aktenzeichen + _autor - _dienststelle + _dokumententyp + _erstelldatum - _hauptthema - _kurzbeschreibung - kurzinfoGruppe - lastChanged + objClass- objType - publicationHinweis - publicationNavigationstitel - publicationStichwort- _stelle + _stichwort - _unterthema - title + validFrom + validUntil - _verteiler - _vertraulich- _zielgruppen+ __dst + (all fields but not _stelle) __intern- __lokal + (all fields but not _stelle) __cdate - __cyear - __source- __doctype + __mikronav - what can lead to this problem? we have the following fields defined in our schema.xml !-- RZF isys -- field name=_aktenzeichen type=string indexed=true stored=true / field name=_anlagedoc type=string indexed=false stored=false / field name=_autor type=textgen indexed=true stored=true / field name=_dienststelle type=string indexed=true stored=true / field name=_dokumententyp type=string indexed=true stored=true / field name=_erstelldatum type=long indexed=true stored=true / field name=_hauptthema type=text_de indexed=true stored=true / field name=_kurzbeschreibung type=text_de indexed=true stored=true / field name=kurzinfoGruppe type=long indexed=true stored=true mulitValued=true/ field name=lastChanged type=long indexed=true stored=true / field name=objClass type=string indexed=true stored=true / field name=objType type=string indexed=true
caching repeated OR'd terms
Hey everyone, I'm having some difficulty figuring out the best way to optimize for a certain query situation. My documents have a many-valued field that stores lists of IDs. All in all there are probably about 10,000 distinct IDs throughout my index. I need to be able to query and find all documents that contain a given set of IDs. Ie, I want to find all documents that contain IDs 3, 202, 3030 or 505. Currently I'm implementing this like so: q= (myfield:3) OR (myfield:202) OR (myfield:3030) OR (myfield:505). It's possible that there could be upwards of hundreds of terms, although 90% of the time it will be under 10. Ideally I would like to do this with a filter query, but I have read that it is impossible to cache OR'd terms in a fq, though this feature may come soon. The problem is that the combinations of OR'd terms will almost always be unique, so the query cache will have a very low hit rate. It would be great if the individual terms could be cached individually, but I'm not sure how to accomplish that. Any suggestions would be welcome! -Kallin Nagelberg
cache control per-request
Hey everyone, Does anyone know if it is possible to control cache behavior on a per-request basis? I would like to be able to use the queryResultCache for certain queries, but have it bypassed for others. IE, I know at query time if there is 0 chance of a hit and would like to avoid the cache on those. If I can do that it leaves space in the cache for those that may actually hit. Thanks, -Kallin Nagelberg
nstein and 3S
Hey everyone, I'm curious if anyone has experiencing working with the company NStein and their Solr based search solution S3. Any comments on performance, usability, support etc. would be really appreciated. Thanks, -Kallin Nagelberg
RE: benefits of float vs. string
When using numerical types you can do ranges like 3 myfield = 10 , as well as a lot of other interesting mathematical functions that would not be possible with a string type. Thanks for the info Yonik, -Kallin Nagelberg -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Friday, April 30, 2010 1:27 AM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Subject: Re: benefits of float vs. string Please explain a range query? tia :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 4/29/10, Yonik Seeley yo...@lucidimagination.com wrote: From: Yonik Seeley yo...@lucidimagination.com Subject: Re: benefits of float vs. string To: solr-user@lucene.apache.org Date: Thursday, April 29, 2010, 1:01 PM On Wed, Apr 28, 2010 at 11:22 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Does anyone have an idea about the performance benefits of searching across floats compared to strings? I have one multi-valued field that contains about 3000 distinct IDs across 5 million documents. I am going to be a lot of queries like q=id:102 OR id:303 OR id:305, etc. Right now it is a String but I am going to switch to a float as intuitively it ought to be easier to filter a number than a string. There won't be any difference in search speed for term queries as you show above. If you don't need to do sorting or range queries on that field, I'd leave it as a String. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague
prefixing with dismax
Hey, I've been using the dismax query parser so that I can pass a user created search string directly to Solr. Now I'm getting the requirement that something like 'Bo' must match 'Bob', or 'Bob Jo' must match 'Bob Jones'. I can't think of a way to make this happen with Dismax, though it's pretty simple with standard syntax. I guess I would just split on space and created ANDed terms like 'myfield:token*' . This doesn't feel like a great approach though, since I'm losing all of the escaping magic of Dismax. Does anyone have any cleaner solutions to this sort of problem? I imagine it's quite common. Thanks, Kallin Nagelberg
RE: Slow Date-Range Queries
You might want to look at DateMath, http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html. I believe the default precision is to the millisecond, so if you afford to round to the nearest second or even minute you might see some performance gains. -Kallin Nagelberg -Original Message- From: Jan Simon Winkelmann [mailto:winkelm...@newsfactory.de] Sent: Thursday, April 29, 2010 4:36 AM To: solr-user@lucene.apache.org Subject: Slow Date-Range Queries Hi, I am currently having serious performance problems with date range queries. What I am doing, is validating a datasets published status by a valid_from and a valid_till date field. I did get a performance boost of ~ 100% by switching from a normal solr.DateField to a solr.TrieDateField with precisionStep=8, however my query still takes about 1,3 seconds. My field defintion looks like this: fieldType name=date class=solr.TrieDateField precisionStep=8 sortMissingLast=true omitNorms=true/ field name=valid_from type=date indexed=true stored=false required=false / field name=valid_till type=date indexed=true stored=false required=false / And the query looks like this: ((valid_from:[* TO 2010-04-29T10:34:12Z]) AND (valid_till:[2010-04-29T10:34:12Z TO *])) OR ((*:* -valid_from:[* TO *]) AND (*:* -valid_till:[* TO *]))) I use the empty checks for datasets which do not have a valid from/till range. Is there any way to get this any faster? Would it be faster using unix-timestamps with int fields? I would appreciate any insight and help on this. regards, Jan-Simon
RE: Evangelism
I had a very hard time selling Solr to business folks. Most are of the mind that if you're not paying for something it can't be any good. That might also be why they refrain from posting 'powered by solr' on their website, as if it might show them to be cheap. They are also fearful of lack of support should you get hit by a bus. This might be remedied by recommending professional services from a company such as lucid imagination. I think your best bet is to create a working demo with your data and show them the performance. Cheers, -Kallin Nagelberg -Original Message- From: Israel Ekpo [mailto:israele...@gmail.com] Sent: Thursday, April 29, 2010 2:19 PM To: solr-user@lucene.apache.org Subject: Re: Evangelism Their main search page has the Powered by Solr logo http://www.lucidimagination.com/search/ On Thu, Apr 29, 2010 at 2:18 PM, Israel Ekpo israele...@gmail.com wrote: Checkout Lucid Imagination http://www.lucidimagination.com/About-Search This should convince you. On Thu, Apr 29, 2010 at 2:10 PM, Daniel Baughman da...@hostworks.comwrote: Hi I'm new to the list here, I'd like to steer someone in the direction of Solr, and I see the list of companies using solr, but none have a power by solr logo or anything. Does anyone have any great links with evidence to majorly successful solr projects? Thanks in advance, Dan B. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/ -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once. http://www.israelekpo.com/
RE: nfs vs sas in production
Thanks all, Tom, your results are interesting. We both have about 5 million documents, but my index is 20 gigs vs. yours 2 TB. I imagine we'll have a much easier time getting quick responses against these small documents compared to your multi-second queries. As for index/search disk contention we're planning to have independent indexing and searching machines, probably following some of the guidelines in this great article, http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr#resources.haproxy . -Kallin Nagelberg -Original Message- From: Burton-West, Tom [mailto:tburt...@umich.edu] Sent: Tuesday, April 27, 2010 6:03 PM To: solr-user@lucene.apache.org Subject: RE: nfs vs sas in production Hi Kallin, Given the previous postings on the list about terrible NFS performance we were pleasantly surprised when we did some tests against a well tuned NFS RAID array on a private network. We got reasonably good results (given our large index sizes.) See http://www.hathitrust.org/blogs/large-scale-search/current-hardware-used-testing and http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance. Just prior to going into production we moved from direct attached storage to a very high performance NAS in production for a number of reasons including ease of management as we scale out. One of the reasons was to reduce contention between indexing/optimizing and search instances for disk I/O. See http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond for details. Tom -Original Message- From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] Sent: Tuesday, April 27, 2010 4:13 PM To: 'solr-user@lucene.apache.org' Subject: nfs vs sas in production Hey, A question was raised during a meeting about our new Solr based search projects. We're getting 4 cutting edge servers each with something like 24 Gigs of ram dedicated to search. However there is some problem with the amount of SAS based storage each machine can handle, and people wonder if we might have to use a NFS based drive instead. Does anyone have any experience using SAS vs. NFS drives for Solr? Any feedback would be appreciated! Thanks, -Kallin Nagelberg
benefits of float vs. string
Hi, Does anyone have an idea about the performance benefits of searching across floats compared to strings? I have one multi-valued field that contains about 3000 distinct IDs across 5 million documents. I am going to be a lot of queries like q=id:102 OR id:303 OR id:305, etc. Right now it is a String but I am going to switch to a float as intuitively it ought to be easier to filter a number than a string. I'm just curious if this should in fact bring a benefit, and more generally what the benefits/penalties to using numerical over string field types is. Thanks, Kallin Nagelberg
nfs vs sas in production
Hey, A question was raised during a meeting about our new Solr based search projects. We're getting 4 cutting edge servers each with something like 24 Gigs of ram dedicated to search. However there is some problem with the amount of SAS based storage each machine can handle, and people wonder if we might have to use a NFS based drive instead. Does anyone have any experience using SAS vs. NFS drives for Solr? Any feedback would be appreciated! Thanks, -Kallin Nagelberg
RE: Benchmarking Solr
I have been using Jmeter to perform some load testing. In your case you might like to take a look at http://jakarta.apache.org/jmeter/usermanual/component_reference.html#CSV_Data_Set_Config . This will allow you to use a random item from your query list. Regards, Kallin Nagelberg -Original Message- From: Blargy [mailto:zman...@hotmail.com] Sent: Friday, April 09, 2010 9:47 PM To: solr-user@lucene.apache.org Subject: Benchmarking Solr I am about to deploy Solr into our production environment and I would like to do some benchmarking to determine how many slaves I will need to set up. Currently the only way I know how to benchmark is to use Apache Benchmark but I would like to be able to send random requests to the Solr... not just one request over and over. I have a sample data set of 5000 user entered queries and I would like to be able to use AB to benchmark against all these random queries. Is this possible? FYI our current index is ~1.5 gigs with ~5m documents and we will be using faceting quite extensively. Are average requests per/day is ~2m. We will be running RHEL with about 8-12g ram. Any idea how many slaves might be required to handle our load? Thanks -- View this message in context: http://n3.nabble.com/Benchmarking-Solr-tp709561p709561.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: index corruption / deployment strategy
Thanks Erik, I forwarded your thoughts to management and put in good word for Lucid Imagination. Regards, Kallin Nagelberg -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Thursday, April 08, 2010 2:18 PM To: solr-user@lucene.apache.org Subject: Re: index corruption / deployment strategy Kallin, It's a very rare report, and practically impossible I'm told, to corrupt the index these days thanks to Lucene's improvements over the last several releases (ignoring hardware malfunctions). A single index is the best way to go, in my opinion - though at your scale you're probably looking at sharding it and using distributed search. So you'll have multiple physical indexes, one for each shard, and a single virtual index in the eyes of your searching clients. Backups, of course, are sensible, and Solr's replication capabilities can help here by requesting them periodically. You'll be using replication anyway to scale to your query volume. As for hardware scaling considerations, there are variables to consider like how faceting, sorting, and querying speed across a single large index versus sharding. I'm guessing you'll be best with at least two shards, though possibly more considering these variables. Erik @ Lucid Imagination p.s. have your higher-ups give us a call if they'd like to discuss their concerns and consider commercial support for your mission critical big scale use of Solr :) On Apr 8, 2010, at 1:33 PM, Nagelberg, Kallin wrote: I've been doing work evaluating Solr for use on a hightraffic website for sometime and things are looking positive. I have some concerns from my higher-ups that I need to address. I have suggested that we use a single index in order to keep things simple, but there are suggestions to split are documents amongst different indexes. The primary motivation for this split is a worry about potential index corruption. IE, if we only have one index and it becomes corrupt what do we do? I never considered this to be an issue since we would have backups etc., but I think they have had issues with other search technology in the past where one big index resulted in frequent and difficult to recover from corruption. Do you think this is a concern with Solr? If so, what would you suggest to mitigate the risk? My second question involves general deployment strategy. We will expect about 50 million documents, each on average a few paragraphs, and our website receives maybe 10 million hits a day. Can anyone provide an idea of # of servers, clustering/replication setup etc. that might be appropriate for this scenario? I'm interested to hear what other's experience is with similar situations. Thanks, -Kallin Nagelberg
RE: multicore embedded swap / reload etc.
Thanks everyone, I was following the solrj wiki which says: If you want to use MultiCore features, then you should use this: File home = new File( /path/to/solr/home ); File f = new File( home, solr.xml ); CoreContainer container = new CoreContainer(); container.load( /path/to/solr/home, f ); EmbeddedSolrServer server = new EmbeddedSolrServer( container, core name as defined in solr.xml ); ... I'm just a little confused with the disconnect between that and what I see about managing multiple cores here: http://wiki.apache.org/solr/CoreAdmin . If someone could provide some high-level directions it would be greatly appreciated. Thanks, -Kallin Nagelberg -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Friday, March 26, 2010 7:54 AM To: solr-user@lucene.apache.org Subject: Re: multicore embedded swap / reload etc. Embedded supports MultiCore - it's the direct core connection thing that supports one. - Mark http://www.lucidimagination.com (mobile) On Mar 26, 2010, at 7:38 AM, Erik Hatcher erik.hatc...@gmail.com wrote: But wait... embedded Solr doesn't support multicore, does it? Just off memory, I think it's fixed to a single core. Erik On Mar 25, 2010, at 10:31 PM, Lance Norskog wrote: All operations through the SolrJ work exactly the same against the Solr web app and embedded Solr. You code the calls to update cores with the same SolrJ APIs either way. On Wed, Mar 24, 2010 at 2:19 PM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Hi, I've got a situation where I need to reindex a core once a day. To do this I was thinking of having two cores, one 'live' and one 'staging'. The app is always serving 'live', but when the daily index happens it goes into 'staging', then staging is swapped into 'live'. I can see how to do this sort of thing over http, but I'm using an embedded solr setup via solrJ. Any suggestions on how to proceed? I could just have two solrServer's built from different coreContainers, and then swap the references when I'm ready, but I wonder if there is a better approach. Maybe grab a hold of the CoreAdminHandler? Thanks, Kallin Nagelberg -- Lance Norskog goks...@gmail.com
multicore embedded swap / reload etc.
Hi, I've got a situation where I need to reindex a core once a day. To do this I was thinking of having two cores, one 'live' and one 'staging'. The app is always serving 'live', but when the daily index happens it goes into 'staging', then staging is swapped into 'live'. I can see how to do this sort of thing over http, but I'm using an embedded solr setup via solrJ. Any suggestions on how to proceed? I could just have two solrServer's built from different coreContainers, and then swap the references when I'm ready, but I wonder if there is a better approach. Maybe grab a hold of the CoreAdminHandler? Thanks, Kallin Nagelberg
RE: lowercasing for sorting
Thanks, and my cover is apparently blown :P We're looking at solr for a number of applications, from taking the load off the database, to user searching etc. I don't think I'll get fired for saying that :P Thanks, Kallin Nagelberg -Original Message- From: Binkley, Peter [mailto:peter.bink...@ualberta.ca] Sent: Tuesday, March 23, 2010 2:09 PM To: solr-user@lucene.apache.org Subject: RE: lowercasing for sorting Solr makes this easy: tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ You can populate this field from another field using copyField, if you also need to be able to search or display the original values. Just out of curiosity, can you tell us anything about what the Globe and Mail is using Solr for? (assuming the question is work-related) Peter -Original Message- From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] Sent: Tuesday, March 23, 2010 11:07 AM To: 'solr-user@lucene.apache.org' Subject: lowercasing for sorting I'm trying to perform a case-insensitive sort on a field in my index that contains values like aaa bbb AA BB And I get them sorted like: aaa bbb AA BB When I would like them: aa aaa bb bbb To do this I'm trying to setup a fieldType who's sole purpose is to lowercase a value on query and index. I don't want to tokenize the value, just lowercase it. Any ideas? Thanks, Kallin Nagelberg
RE: How to use dismax and boosting properly?
Try setting the boost to 0 for the fields you don't want to contribute to the score. Kallin Nagelberg -Original Message- From: Jason Chaffee [mailto:jchaf...@ebates.com] Sent: Thursday, February 25, 2010 4:03 PM To: solr-user@lucene.apache.org Subject: How to use dismax and boosting properly? I am using dismax and I have configured to search 3 different fields with one field getting an extra boost so that I the results of that field are at the top of result set. Then, I sort the results by another field to get the ordering. My problem is that the scores are being skewed by the adding the scores from the different fields. What I really want is to have all matches in the boost field have an equal score and take precedence over matches from other fields. I want them to have the same score so that the sorting will sort them alphabetically. Therefore, the scores must be the same. Because the query is being found in all three fields with different number of occurrences some scores are being skewed in the boosted matches and it is putting them at the top of my results and alphabetically, they should be near the bottom. Here is an example, in case my explanation isn't clear: I have dismax with the following config: str name=qfField1^3.0 Field2^0.1 Field3^0.1/str str name=sortscore desc, sortField asc/str Where sortField is the original keyword token, without any processing except for lowercase. Field1 (the boosted field) a at at att a ab abe abeb abebo abeboo abebook abebooks Field2 a at at att a at att a at at at at at t att att att at t att att at t att att at t att att at t att att at t att att at t att att at t abebooks a ab abe abeb abebo abeboo abebook abebooks Field3 a at at att a at att a at at at at at t att att att at t att att at t att att at t att att at t att att at t att att at t att att at t att att at t abebooks a ab abe abeb abebo abeboo abebook abebooks The user types in the query 'a': Here is the debugQuery: str name=ATT 5.4186125 = (MATCH) sum of: 2.7147598 = (MATCH) max plus 0.1 times others of: 0.10907243 = (MATCH) weight(Field2:a^0.1 in 80), product of: 0.01970195 = queryWeight(Field2:a^0.1), product of: 0.1 = boost 3.1962826 = idf(docFreq=117, maxDocs=1061) 0.0616402 = queryNorm 5.5361238 = (MATCH) fieldWeight(Field2:a in 80), product of: 1.7320508 = tf(termFreq(Field2:a)=3) 3.1962826 = idf(docFreq=117, maxDocs=1061) 1.0 = fieldNorm(field=Field2, doc=80) 2.7038527 = (MATCH) weight(Field1:a^3.0 in 80), product of: 0.7071054 = queryWeight(Field1:a^3.0), product of: 3.0 = boost 3.8238325 = idf(docFreq=62, maxDocs=1061) 0.0616402 = queryNorm 3.8238325 = (MATCH) fieldWeight(Field1:a in 80), product of: 1.0 = tf(termFreq(Field1:a)=1) 3.8238325 = idf(docFreq=62, maxDocs=1061) 1.0 = fieldNorm(field= Field1, doc=80) 2.7038527 = (MATCH) weight(Field1:a^3.0 in 80), product of: 0.7071054 = queryWeight(Field1:a^3.0), product of: 3.0 = boost 3.8238325 = idf(docFreq=62, maxDocs=1061) 0.0616402 = queryNorm 3.8238325 = (MATCH) fieldWeight(Field1:a in 80), product of: 1.0 = tf(termFreq(Field1:a)=1) 3.8238325 = idf(docFreq=62, maxDocs=1061) 1.0 = fieldNorm(field= Field1, doc=80) /str str name=Abebooks 5.4140024 = (MATCH) sum of: 2.71015 = (MATCH) max plus 0.1 times others of: 0.062973 = (MATCH) weight(edgeNGramStandardField:a^0.1 in 138), product of: 0.01970195 = queryWeight(edgeNGramStandardField:a^0.1), product of: 0.1 = boost 3.1962826 = idf(docFreq=117, maxDocs=1061) 0.0616402 = queryNorm 3.1962826 = (MATCH) fieldWeight(edgeNGramStandardField:a in 138), product of: 1.0 = tf(termFreq(edgeNGramStandardField:a)=1) 3.1962826 = idf(docFreq=117, maxDocs=1061) 1.0 = fieldNorm(field=edgeNGramStandardField, doc=138) 2.7038527 = (MATCH) weight(edgeNGramKeywordField:a^3.0 in 138), product of: 0.7071054 = queryWeight(edgeNGramKeywordField:a^3.0), product of: 3.0 = boost 3.8238325 = idf(docFreq=62, maxDocs=1061) 0.0616402 = queryNorm 3.8238325 = (MATCH) fieldWeight(edgeNGramKeywordField:a in 138), product of: 1.0 = tf(termFreq(edgeNGramKeywordField:a)=1) 3.8238325 = idf(docFreq=62, maxDocs=1061) 1.0 = fieldNorm(field=edgeNGramKeywordField, doc=138) 2.7038527 = (MATCH) weight(edgeNGramKeywordField:a^3.0 in 138), product of: 0.7071054 = queryWeight(edgeNGramKeywordField:a^3.0), product of: 3.0 = boost 3.8238325 = idf(docFreq=62, maxDocs=1061) 0.0616402 = queryNorm 3.8238325 = (MATCH)
stop words make dismax fail
I'm having a problem when users enter stopwords in their query. I'm using a dismax request handler against a field setup like: fieldType name=simpleText class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.LengthFilterFactory min=2 max=20 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.LengthFilterFactory min=2 max=20 / /analyzer /fieldType The problem is that when a user enters a query like 'meet the president', zero results are returned. I imagine it has something to do with 'the' being stripped out, then only 2 of the 3 terms matching. As a temporary workaround I set minshouldmatch to 1 so I do get results. That causes other problems though, such as 'the' never being highlighted in the results. Am I doing something totally wrong? Thanks, Kallin Nagelberg
including 'the' dismax query kills results
I've noticed some peculiar behavior with the dismax searchhandler. In my case I'm making the search The British Open, and am getting 0 results. When I change it to British Open I get many hits. I looked at the query analyzer and it should be broken down to british and open tokens ('the' is a stopword). I imagine it is doing an 'and' type search, and by setting the 'mm' parameter to 1 I once again get results for 'the british open'. I would like mm to be 100% however, but just not care about stopwords. Is there a way to do this? Thanks, -Kal
filter queries not fully filtering
Hi everyone, I am attempting to implement a faceted drill down feature with Solr. I am having problems explaining some results of the fq parameter. Let's say I have two fields, 'people' and 'category'. I do a search for 'dog' and ask to facet on the people and category fields. I am told that there are 200 documents with people='bob' and 100 with category='news'. I would expect that when I make the query q=dog, fq=category:news that the new faceting results should never show more than 100 entries. However this is not what I see. Instead I see facets on fields exceeding 100. How could that be when I just told it to filter and only show the 100 articles that contained category:news? Thanks, Kallin Nagelberg.
RE: filter queries not fully filtering
Problem solved. I wasn't quoting the value. Since I was using names such as 'Gary Bettman' solr must have been giving all the Garys. -Original Message- From: Nagelberg, Kallin [mailto:knagelb...@globeandmail.com] Sent: Tuesday, February 16, 2010 3:22 PM To: 'solr-user@lucene.apache.org' Subject: filter queries not fully filtering Hi everyone, I am attempting to implement a faceted drill down feature with Solr. I am having problems explaining some results of the fq parameter. Let's say I have two fields, 'people' and 'category'. I do a search for 'dog' and ask to facet on the people and category fields. I am told that there are 200 documents with people='bob' and 100 with category='news'. I would expect that when I make the query q=dog, fq=category:news that the new faceting results should never show more than 100 entries. However this is not what I see. Instead I see facets on fields exceeding 100. How could that be when I just told it to filter and only show the 100 articles that contained category:news? Thanks, Kallin Nagelberg.
parabolic type function centered on a date
Hi everyone, I'm trying to enhance a more like this search I'm conducting by boosting the documents that have a date close to the original. I would like to do something like a parabolic function centered on the date (would make tuning a little more effective), though a linear function would probably suffice. Has anyone attempted this? If so I'd love to hear your strategy and results! Thanks, Kallin Nagelberg
ord on TrieDateField always returning max
Hi everyone, I've been trying to add a date based boost to my queries. I have a field like: fieldType name=tdate class=solr.TrieDateField omitNorms=true precisionStep=6 positionIncrementGap=0/ field name=datetime type=tdate indexed=true stored=true required=true / When I look at the datetime field in the solr schema browser I can see that there are 9051 distinct dates. When I try to add the parameter to my query like: bf=ord(datetime) (on a dismax query) I always get 9051 as the result of the function. I see this in the debug data: 1698.6041 = (MATCH) FunctionQuery(top(ord(datetime))), product of: 9051.0 = 9051 1.0 = boost 0.18767032 = queryNorm It is exactly the same for every result, even though each result has a different value for datetime. Does anyone have any suggestions as to why this could be happening? I have done extensive googling with no luck. Thanks, Kallin Nagelberg.
RE: ord on TrieDateField always returning max
Thanks Yonik, I was just looking at that actually. Trying something like recip(ms(NOW,datetime),3.16e-11,1,1)^10 now. My 'inspiration' for the ord method was actually the Solr 1.4 Enterprise Search server book. Page 126 has a section 'using reciprocals and rord with dates'. You should let those guys know what's up! Thanks, Kallin. -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Wednesday, January 06, 2010 11:23 AM To: solr-user@lucene.apache.org Subject: Re: ord on TrieDateField always returning max Besides using up a lot more memory, ord() isn't even going to work for a field with multiple tokens indexed per value (like tdate). I'd recommend using a function on the date value itself. http://wiki.apache.org/solr/FunctionQuery#ms -Yonik http://www.lucidimagination.com On Wed, Jan 6, 2010 at 10:52 AM, Nagelberg, Kallin knagelb...@globeandmail.com wrote: Hi everyone, I've been trying to add a date based boost to my queries. I have a field like: fieldType name=tdate class=solr.TrieDateField omitNorms=true precisionStep=6 positionIncrementGap=0/ field name=datetime type=tdate indexed=true stored=true required=true / When I look at the datetime field in the solr schema browser I can see that there are 9051 distinct dates. When I try to add the parameter to my query like: bf=ord(datetime) (on a dismax query) I always get 9051 as the result of the function. I see this in the debug data: 1698.6041 = (MATCH) FunctionQuery(top(ord(datetime))), product of: 9051.0 = 9051 1.0 = boost 0.18767032 = queryNorm It is exactly the same for every result, even though each result has a different value for datetime. Does anyone have any suggestions as to why this could be happening? I have done extensive googling with no luck. Thanks, Kallin Nagelberg.