commit, concurrency, full text search
Hi, 1)How does the commit works with multiple requests? 2)Does SOLR handle the concurrency during updates? 3)Does solr support any thing like, if I enclose the keywords within quotes, then we are searching for exactly those keywords together. Some thing like google does, for example if I enclose like this java programming then it should search for this keyword as a whole instead breaking the phrase apart. Any help would be highly appreciated. Thanks in advance. Regards, Dilip
largish test data set?
Hi, I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. BTW, I would also welcome any input from others who have done the above comparison, although what we'll be using it for is specific enough that of course I'll need to do my own testing. Thanks! -- David N. Welton http://www.welton.it/davidw/
solr locked itself out
Hello everyone. I've been reading some posts on this forum and I thought it best to start my own post as our situation is different from evveryone elses, isn't it always :-) We've got a django powered website that has solr as it's search engine. We're using the example solr application and starting the java at boot time with java -jar start.jar in the example directory We've had no problem at all until this morning when I started getting an error saying that solr was locked. I checked the /tmp directory and in there was a file called lucene-75248553b96c7f175a8217320c9b8471-write.lock It's not a very busy website at all and doesn't have alot of data in it, can someone get me started on how to make sure this doesn't happen again? some more information ulimit is unlimited and cat /proc/sys/fs/file-max 11769 in the /tmp directory are 18 directories all called Jetty_8983__solr and 17 of them have numbers at the end of the directory name. Sorry I'm such a newbie at this, but any help will be greatly appreciated. -- View this message in context: http://www.nabble.com/solr-locked-itself-out-tf4466377.html#a12734891 Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr locked itself out
vanderkerkoff wrote: I found another post that suggested editing the unlockonstartup value in solrconfig.xml. Is that a wise idea? If you only have a single solr instance at at time, it should be totally fine.
Re: Can we build complex filter queries in SOLR
yeah that is possible, I just tried on one of my solr instances..let's say you have an index of player names: (first-name:Tim AND last-name:Anderson) OR (first-name:Anwar AND last-name:Johnson) OR (conference:Mountain West) will give you the results that logically match this query.. HTH. Alessandro Ferrucci :) On 9/17/07, Dilip.TS [EMAIL PROTECTED] wrote: Hi, I would like to know if we can build a complex filter queryString in SOLR using the following condition. (Field1 = abc AND Field2 = def) OR (Field3 = abcd AND Field4 = defgh AND (...)). so on... Thanks in advance Regards, Dilip TS
Re: largish test data set?
You might be interested in the Lucene Java contrib/Benchmark task, which provides an indexing implementation of a download of Wikipedia (available at http://people.apache.org/~gsingers/wikipedia/) It is pretty trivial to convert the indexing code to send add commands to Solr. HTH, Grant On Sep 17, 2007, at 6:06 AM, David Welton wrote: Hi, I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. BTW, I would also welcome any input from others who have done the above comparison, although what we'll be using it for is specific enough that of course I'll need to do my own testing. Thanks! -- David N. Welton http://www.welton.it/davidw/
Re: largish test data set?
Hi Yonik. Do you have any performance statistics about those changes? Is it possible to upgrade to this new Lucene version using the Solr 1.2 stable version? Regards, Daniel On 17/9/07 17:37, Yonik Seeley [EMAIL PROTECTED] wrote: If you want to see what performance will be like on the next release, you could try upgrading Solr's internal version of lucene to trunk (current dev version)... there have been some fantastic improvements in indexing speed. For query speed/throughput, Solr 1.2 or trunk should do fine. -Yonik On 9/17/07, David Welton [EMAIL PROTECTED] wrote: Hi, I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. BTW, I would also welcome any input from others who have done the above comparison, although what we'll be using it for is specific enough that of course I'll need to do my own testing. Thanks! -- David N. Welton http://www.welton.it/davidw/ http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Re: 'suggest' query sorting
Hello! Were you able to find out anything? I'd be interested to know what you found out. ++ | Matthew Runo | Zappos Development | [EMAIL PROTECTED] | 702-943-7833 ++ On Sep 15, 2007, at 11:48 PM, Ryan McKinley wrote: Hello- I'm building an interface where I need to display matching options as a user types into a search box. Something like google suggest, but it needs to be a little more flexible in its matches. It first glance, I thought I just needed to write a filter that chunks each token into a set of prefixes. Check SOLR-357 -- As Hoss points out, I may just be able to use the EdgeNGramFilterFactory. I have the basics working, but need some help getting the details to behave properly. Consider the strings: Canon PowerShot iPod Cable Canon EX PIXMA Video Card If I query for 'ca' I expect to get all these back. This works fine, but I need help with is ordering. How can I boost words where the whole value (not just the token) is closer to the front of the value? That is, I want 'ca' to return: 1. Canon PowerShot 2. Canon EX PIXMA 3. iPod Cable 4. Video Card (actually 12 could be swapped) After that works, how do I boost tokens that are closer together? If I search for 'canon p', how can I make sure the results are returned as: 1. Canon PowerShot 2. Canon EX PIXMA thanks ryan
Re: largish test data set?
If you want to see what performance will be like on the next release, you could try upgrading Solr's internal version of lucene to trunk (current dev version)... there have been some fantastic improvements in indexing speed. For query speed/throughput, Solr 1.2 or trunk should do fine. -Yonik On 9/17/07, David Welton [EMAIL PROTECTED] wrote: Hi, I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. BTW, I would also welcome any input from others who have done the above comparison, although what we'll be using it for is specific enough that of course I'll need to do my own testing. Thanks! -- David N. Welton http://www.welton.it/davidw/
Re: largish test data set?
17 sep 2007 kl. 12.06 skrev David Welton: I'm in the process of evaluating solr and sphinx, and have come to realize that actually having a large data set to run them against would be handy. However, I'm pretty new to both systems, so thought that perhaps asking around my produce something useful. What *I* mean by largish is something that won't fit into memory - say 5 or 6 gigs, which is probably puny for some and huge for others. IMDB is about 1.2GB of data: http://www.imdb.com/interfaces#plain You can extract real queries from the TPB data collection, it should contain about 1M queries in the movie category: http://torrents.thepiratebay.org/3783572/ db_dump_and_query_log_from_piratebay.org__summer_of_2006.3783572.TPB.tor rent -- karl
Re: Re[2]: multiple indices
Jack, the JNDI-enabling jarfiles now ship as part of the main .zip distribution. There is no need for a separate JettyPlus download as of Jetty 6. I used Jetty 6.1.3 (http://dist.codehaus.org/jetty/jetty-6.1.x/ jetty-6.1.3.zip) at the time, and I am using only these jarfiles from the main distribution. I stripped everything else out that seemed unnecessary for running Solr. lib/jetty-6.1.3.jar lib/jetty-util-6.1.3.jar lib/jsp-2.1/ant-1.6.5.jar lib/jsp-2.1/core-3.1.1.jar lib/jsp-2.1/jsp-2.1.jar lib/jsp-2.1/jsp-api-2.1.jar lib/naming/jetty-naming-6.1.3.jar lib/plus/jetty-plus-6.1.3.jar lib/servlet-api-2.5-6.1.3.jar --Matt On Sep 13, 2007, at 11:44 AM, Jack L wrote: Thanks Matt, I'll give it a try! So this requires JettyPlus? -- Best regards, Jack Wednesday, September 12, 2007, 5:14:32 AM, you wrote: Jack, I've posted a complete recipe for running two Solr indices within one Jetty 6 container: http://wiki.apache.org/solr/SolrJetty Scroll down to the part that says: (7/2007 MattKangas) The recipe above didn't work for me with Jetty 6.1.3. ... I'm glossing over a lot of details, so attached is a tarball with a known-good configuration that runs two Solr instances inside one Jetty container. I'm using Solr 1.2.0 and Jetty 6.1.3 respectively. Hope this helps, --matt On Sep 11, 2007, at 11:52 AM, Jack L wrote: I was going through some old emails on this topic. Rafael Rossini figured out how to run multiple indices on single instance of jetty but it has to be jetty plus. I guess jetty doesn't allow this? I suppose I can add additional jars and make it work but I haven't tried that. It'll always be much safer/simpler/less playing around if a feature is available out of box. I'm mentioning this again because I really think it's a desirable feature, especially because each JVM uses a lot of memory and sometimes it's not possible to start a new jetty for each index due to memory limitation. I understand I can use a type field and mix doc types but this is not ideal for two reasons: 1. it's easier to maintain separate indices. I can just wipe out all the files and re-post an individual index. Much less posting work to do as opposed to re-posting all docs. Or I can move one index to another partition, or even to another server to run separately in order to scale up. It'll be a problem (although solvable by deleting and re-posting) with a mixed index. 2. my understanding is that mixed index means larger index files and slower performance JettyPlus's download links seem to be broken so I wasn't able to check its download size. If not too big, maybe JettyPlus is an option? If not, there should be a way to have this feature implemented on solr side? Maybe by prefixing the REST URLs with index names... -- Thanks, Jack -- Matt Kangas / [EMAIL PROTECTED] -- Matt Kangas / [EMAIL PROTECTED]
Re: Indexing Speed
On 16-Sep-07, at 8:01 PM, erolagnab wrote: Hi, Just a FYI. I've seen some posts mentioned that Solr can index 100-150 docs/s and the comparison between embedded solr and HTTP. I've tried to do the indexing with 1.7+ million docs, each doc has 30 fields among which 10 fields are indexed/stored and the rest are only stored. The result was pretty impressive, it took approx 1.4 hour to finish. Noted that, the docs were sent synchronously, one after the other. The solr server and client were both running on Pentium Dual Core 3.2, 2G Ram, Ubuntu Feisty. The only issue I noticed is that, Solr does occupy some amount of memory. In the first run, after indexing around 500 thousands docs, it threw OutOfMemory exception. In the second trial, I setup -Xms and -Xmx for the JVM to run on 1G memory, Solr performed till the finish. You can tune memory usage by setting maxBufferedDocs to a lower value. Also, watch out for large individual docs. Some questions 1) Is it a good practice to allow Solr indexing docs in real time (millions docs per day)? What I'm worry is that, Solr may eat up the memory as it goes. You can tune max memory usage (see above). 2) If docs are sent asynchronously, how well could Solr can index? As long as you don't send 1.7million docs at once, you should see a performance improvement. -Mike
RE: Triggering snapshooter through web admin interface
There is no way to trigger snapshots taking through Solr's admin interface now. Taking a snapshot is a very light-weight operation. It uses hard links so each snapshot doesn't take up much additional disk space. If you [Wu, Daniel] It is not a concern on the snapshot performance. Rather, it is the data consistency requirement. I was also suggesting a new feature to allow sending messages to Solr through http interface and a mechanism to handling the message on the Solr server; in this case, a message to trigger snapshooter script. It seems to me, a very useful feature to help simplify operational issues. don't want to replicate your index while the big batch job is still running, you can disable snappuller on the slave while the batch job is running and enable it after the batch job has completed. [Wu, Daniel] Yes, it can be done that way but will not be as elegant. The Solr master can be well-known among the applications; however, the slaves could be anywhere. To turn off snapulling requires the knowledge of all the Solr slaves as well as timing of the indexing. It gets ugly when there are multiple environments (e.g. dev, qa, stage, production) and multiple indexes.
Faceting Vs using lucene filters ?
Hi, I have a collection of blogs. Each Solr document has one blog with 3 fields - blogger(id), title and blog text. The search is performed over all 3 fields. When doing the search I need to show 2 things: 1. Bloggers block with all the matching bloggers (so if a title, blog or blogger contains the search term, I show the blogger's id) 2. Blogs block that shows the blog titles for the matching blogs. The first block is my problem since it shows multiple instances of the same blogger if that blogger has multiple matching blogs. I can use faceting to show the bloggers but is there a better or more efficient way to do so? I was thinking of creating a lucene filter to do this, is it feasible? Basically, I need the unique bloggers from the index whose blogs match a given search term. Thanks! -- View this message in context: http://www.nabble.com/Faceting-Vs-using-lucene-filters---tf4469665.html#a12744115 Sent from the Solr - User mailing list archive at Nabble.com.
RE: Triggering snapshooter through web admin interface
: I was also suggesting a new feature to allow sending messages to Solr : through http interface and a mechanism to handling the message on the : Solr server; in this case, a message to trigger snapshooter script. It : seems to me, a very useful feature to help simplify operational issues. it's been a while since i looked at the SolrEventListener stuff, but i think that would be pretty easy to develop as a plugin. The existing postCommit/postOptimizefirstSearcher/newSearcher event listener tracking are part of hte SolrCore because it needs to know about them when managing the index ... but if you just wanted a way to trigger arbitrary events by name, the utility functions used in SolrCore could be reused by a custom plugin ... then you could reuse things like the RunExecutableListener from your own RequestHandler with the same solrconfig.xml syntax. that would be a pretty cool addition to Solr ... an EventRequestHandler that takes in a single event param and triggers all of the Listeners configured for that even in the solrconfig.xml -Hoss
Re: Faceting Vs using lucene filters ?
: 1. Bloggers block with all the matching bloggers (so if a title, blog or : blogger contains the search term, I show the blogger's id) : The first block is my problem since it shows multiple instances of the same : blogger if that blogger has multiple matching blogs. I can use faceting to : show the bloggers but is there a better or more efficient way to do so? I you've basically described the exact use case for faceting ... I'd be hard pressed to think of a more efficient way of listing every blogger who has at least one blog matching the query. -Hoss
Re: Combining Proximity Range search
: My document will have a multivalued compound field like : : revision_01012007 : review_02012007 : : i am thinking of a query like comp:type:review date:[02012007 TO : 02282007]~0 your best bet is to change that so revision and review are the names of a field, and do a range search on them as needed. -Hoss
Re: 'suggest' query sorting
: How can I boost words where the whole value (not just the token) is closer to : the front of the value? That is, I want 'ca' to return: : 1. Canon PowerShot : 2. Canon EX PIXMA : 3. iPod Cable : 4. Video Card : (actually 12 could be swapped) i would argue that you don't want #3 and #4 at all if you are doing query suggestion, instead make hte field you query use a KeywordTokenizer with the EdgeNGramFilter so ca only matches #1 and #2. if you really want #3 and #4 to show up, then have two fields: one using whitespace tokenizer, one using keyword tokenizer; both using EdgeNGramFilter ... boost the query to the first field higher then the second field (or just rely on the coordFactor and the fact that ca will match on both fields for Canon PowerShot but only on thesecond field for iPod Cable : After that works, how do I boost tokens that are closer together? If I search : for 'canon p', how can I make sure the results are returned as: : 1. Canon PowerShot : 2. Canon EX PIXMA i think the two fields i described above will solve that problem as well. -Hoss
Re: EdgeNGramTokenFilter, term position?
: Should the EdgeNGramFilter use the same term position for the ngrams within a : single token? i can see the argument going both ways ... imagine a hypothetical CharSplitterTokenFilter that takes replaces each token in the stream with one token per character in the orriginal token (ie: hello becomes h,e,l,l,o) ... should those tokens all have the same position? the have a logical ordered flow to them, so in theory they are sequential ... but they did occupy the same space in the orriginal token stream. when in doubt: make it an option -Hoss
Re: EdgeNGramTokenFilter, term position?
On 9/16/07, Ryan McKinley [EMAIL PROTECTED] wrote: Should the EdgeNGramFilter use the same term position for the ngrams within a single token? It feels like that is the right approach. I don't see value in having them sequential, and I can think of uses for having them overlap. -Yonik
Re: Control index/store at document level
: nope, the field options are created on startup -- you can't change them : dynamically (i don't know all the details, but I think it is a file format : issue, not just a configuration issue) In the underlying Lucene library most of these options can be controlled per document, but Solr simplifies this away into configuration options instead of run time input ... it's a feature not a bug :) : I'm not sure how your app is structured, from what you describe, it sounds : like you need two fields, one that is indexed and not stored and another that : is stored and not indexed. For each revision, put text into the indexed : field. for the primary document, put it in both. I concur. copyField can make this easy ... make the source field stored, and the destination field indexed. for the primary doc add to the source and let the copyField do it's magic ... for all other revisions add directly to the destination field yourself. -Hoss
Re: Solr - rudimentary problems
: The corresponding entry for this field in schema.xml is : : field name=id type=text indexed=true : stored=true multiValued=false required=true/ i'm guessing text is from the example schema.xml ... this is not a good type to use for a uniqueId field ... that alone might be causing some of your problems with replaceing docs ... try string : 2) Also, at the time of deleting a document, by providing its ID(exactly : similar to the deleteById proc in the Embedded Solr example) , we find that : the document is not getting deleted(and we also do not get any errors). sounds like the same problem ... i'm guessing you are using a method that assumes the id has already been transformed into the internal representation ... with text that might be lowercased, or stemmed, etc -Hoss
Re: solr locked itself out
ulimit is unlimited and cat /proc/sys/fs/file-max 11769 I just went through the same kind of mistake - ulimit doesn't report what you think it does, what you should check is ulimit -n (the -n isn't just the option to set the value). If you're using bash as your shell that will almost certainly by 1024 which I've found isn't enough to search and write at the same time. The commit winds up throwing an exception and the lock file(s) get left around causing further problems. The first thing I'd try is upping the ulimit -n to 2 and see if that resolves the issue, it did for me. Regards, Adrian Sutton http://www.symphonious.net
UserTagDesign
I've been looking at http://wiki.apache.org/solr/UserTagDesign on and off for a while and think all the use cases could be explained with simple UML class diagram semantics: [Taggable](tag:Tag)-- {0..*} |--- {0..*} --(tag:Tag)[Tagger] | [Tagging] Rendered: http://ginandtonique.org/~kalle/tagging.pdf This is of course a design that might not fit everybody, it could be represented using an n-ary association or what not. But I find the text on the wiki much easier to follow with this in my head. How (or even if) one would represent this in a index is a completely different story. Translated to Java the diagram would look something like this: /** the user */ class Tagger { MapTag, SetTagging taggingsByTag; } /** the content */ class Taggable { MapTag, Set Tagging taggingsByTag; } /** content tagging */ class Tagging { Tagger tagger; Taggable tagged; Date created; } class Tag { String text; } Thought it was better to let you people decide whether or not this fits in the wiki. -- karl
Re: 'suggest' query sorting
if you really want #3 and #4 to show up, then have two fields: one using whitespace tokenizer, one using keyword tokenizer; both using EdgeNGramFilter ... boost the query to the first field higher then the second field (or just rely on the coordFactor and the fact that ca will match on both fields for Canon PowerShot but only on thesecond field for iPod Cable I'm working with person names that are sometimes reversed... it needs to treat the last name (that may be the first name) with the same weight. Yes, this scheme works great. Thanks. I added the config I'm using to SOLR-357 and closed the issue. Hopefully the next person searching for how to do this will know to look at the EdgeNGramFilter ryan
RE: Triggering snapshooter through web admin interface
-Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Monday, September 17, 2007 1:28 PM To: solr-user@lucene.apache.org Subject: RE: Triggering snapshooter through web admin interface : I was also suggesting a new feature to allow sending messages to Solr : through http interface and a mechanism to handling the message on the : Solr server; in this case, a message to trigger snapshooter script. It : seems to me, a very useful feature to help simplify operational issues. it's been a while since i looked at the SolrEventListener stuff, but i think that would be pretty easy to develop as a plugin. The existing postCommit/postOptimizefirstSearcher/newSearcher event listener tracking are part of hte SolrCore because it needs to know about them when managing the index ... but if you just wanted a way to trigger arbitrary events by name, the utility functions used in SolrCore could be reused by a custom plugin ... then you could reuse things like the RunExecutableListener from your own RequestHandler with the same solrconfig.xml syntax. that would be a pretty cool addition to Solr ... an EventRequestHandler that takes in a single event param and triggers all of the Listeners configured for that even in the solrconfig.xml -Hoss [Wu, Daniel] That sounds great. Do I need to create a JIRA ticket?
Re: Solr - rudimentary problems
C'est Parfait! .. yes - that was the problem. thanks a lot. I am compiling a complete list of FAQs - will update it in the wiki soon. -vEnKAt On 9/18/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The corresponding entry for this field in schema.xml is : : field name=id type=text indexed=true : stored=true multiValued=false required=true/ i'm guessing text is from the example schema.xml ... this is not a good type to use for a uniqueId field ... that alone might be causing some of your problems with replaceing docs ... try string : 2) Also, at the time of deleting a document, by providing its ID(exactly : similar to the deleteById proc in the Embedded Solr example) , we find that : the document is not getting deleted(and we also do not get any errors). sounds like the same problem ... i'm guessing you are using a method that assumes the id has already been transformed into the internal representation ... with text that might be lowercased, or stemmed, etc -Hoss --