Re: Where to find the Log file
On Jun 9, 2011, at 5:45 PM, Ruixiang Zhang wrote: Where can I find the log file of solr? (I use Jetty) By default, it's in yourapp/solr/logs/solr.log Is it turned on by default? Yes. Oh, yes. Very much so. Uh-huh, you betcha. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Strategy -- Frequent updates in our application
On Jun 2, 2011, at 8:29 PM, Naveen Gupta wrote: and what about NRT, is it fine to apply in this case of scenario Is NRT really what's wanted here? I'm asking the experts, as I have a situation not too different from the b.p. It appears to me (from the dox) that NRT makes a difference in the lag between a document being added and it being available in searches. But the BP really sounds to me like a concern over documents-added-per-second. Does the RankingAlgorithm form of NRT improve the docs-added-per-second performance? My add-to-view limits aren't really threatened by Solr performance today; something like 30 seconds is just fine. But I am feeling close enough to the documents-per-second boundary that I'm pondering measures like master/slave. If NRT only improvs add-to-view lag, I'm not overly interested, but if it can improve add throughput, I'm all over it ;-) -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Using multiple CPUs for a single document base?
Is there a way to allow Solr to use multiple CPUs of a single, multi-core box, to increase scale (number of documents, number of searches) of the searchbase? The CoreAdmin wiki page talks about Multiple Cores as essentially independent document bases with independent indexes, but with some unification of administration at the grosser levels. That's not quite what I'm looking for, though. I want a single URL for add and search access, and a single logical searchbase, but I want to be able to use more of the resources of the physical box where the searchbase runs. I guess I thought I would get this for free, it being Java and all, but I don't seem to: even with hundreds of clients adding and searching, I only seem to use one hardware core, and a bit of a second (which I interpret to mean one Java thread for Solr, one Java thread for Java I/O). -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Using multiple CPUs for a single document base?
On May 31, 2011, at 11:16 AM, Markus Jelsma wrote: Are you using a 1.4 version of Solr? Yeah, about those version numbers ... The tarball I installed claimed its version was apache-solr-3.1.0 Which sounds comfortably later than 1.4. But the examples/solr/schema.xml that comes with it claims version 1.3. I'm confused. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Using multiple CPUs for a single document base?
On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote: I kind of think you should get multi-CPU use 'for free' as a Java app too. Ah, probably experimental error? If I apply a stress load consisting only of queries, I get automatic multi-core use as expected. I could see where indexing new dox could tend toward synchronization and uniprocessing. Perhaps my original test load was too add-centric, does that make sense? -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Using multiple CPUs for a single document base?
On May 31, 2011, at 12:24 PM, Jonathan Rochkind wrote: I do all my 'adds' to a seperate Solr index, and then replicate to a slave that actually serves queries. Yes, that's a step I'm holding in reserve. Probably get there some day, as I expect always to have a very high add-to-query ratio. But for the moment, I don't think I need it. My 'master' that I do my adds to is actually on the very same server -- but I run it in an entirely different java container, Now THAT was an interesting data point, thanks very much! I hadn't thought of running the master on the same box! -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: Using multiple CPUs for a single document base?
On May 31, 2011, at 12:44 PM, Markus Jelsma wrote: I haven't given it a try but perhaps opening multiple HTTP connections to the update handler will end up in multiple threads thus better CPU utilization. My original test case had hundreds of HTTP connections (all to the same URL) doing adds, but seemed to use only one CPU core for adding, or to serialize the adds somehow, something like that ... at any rate, I couldn't drive CPU use above ~120% with that configuration. This is quite different from queries. For queries (or a rich query-to-add mix), I can easily drive CPU use into multiple-hundreds of % CPU, with just a few dozen concurrent query connections (running flat out). But adds resist that trick. I don't know whether this means that adds really are using a single thread, or if they're using multiple threads but synchronizing on some monitor. Actually, I can't say I care much: bottom line seems to be I only use one CPU core (plus a negligible marginal bit) for adds. Since I've confirmed that queries spread neatly, I can live with the single-thready adds. In production, it seems likely that I'll be more or less continuously spending one CPU core on adds, and the rest on queries. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: What's your query result cache's stats?
On May 31, 2011, at 2:02 PM, Markus Jelsma wrote: the cumulative hit ratio of the query result cache, it's almost never higher than 50%. What are your stats? How do you deal with it? warmupTime : 0 cumulative_lookups : 394867 cumulative_hits : 394780 cumulative_hitratio : 0.99 cumulative_inserts : 87 cumulative_evictions : 0 Of course, that's shortly after I ran a query-intensive, not very creative load test (thousands of identical queries of a not very changeable data set). As a matter of fact, the numbers say I had exactly one miss after each insert, and everything else was a cache hit. Which makes perfect sense, for my (really dumb) test case. In some cases i have to disable it because of the high warming penalty i get in a frequently changing index. This penalty is worse than the very little performance gain i get. Different users accidentally using the same query or a single user that's actually browsing the result set only happens very occasionally. And if i wanted the hit ratio to climb i'd have to increase the cache size and warming size to absurd values, only then i might just reach about 60% hit ratio. If you have humans randomizing the query stream, I'm sure you're right. If you're convinced your queries are unrelated and variable, why would you expect a query cache to help at all? On the other hand, I actually plan to use my Solr base to drive a UI, where the query parameters never change, and the data underneath changes mostly in bursts (generally near the end of the work day), so I suspect I'll only see misses after a document add, while lookups ten to cluster early in the day. So I actually am hoping for a high hit ratio. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Re: copyField of dates unworking?
On May 27, 2011, at 1:04 AM, Ahmet Arslan wrote: The letter f should be capital Hah! Well-spotted! Thanks. -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
copyField of dates unworking?
Are there some sort of rules about what sort of fields can be copyFielded into other fields? My schema has (among other things): field name=date type=tdate indexed=true stored=true required=true / field name=user type=string indexed=true stored=true required=true / field name=text type=textgen indexed=true stored=true required=false multiValued=true / ... copyField source=user dest=text/ copyfield source=date dest=text/ The user field gets copied into text just fine, but the date field does not. In case they're handy, I've attached: - schema.xml - the complete schema - solr-usr-question.xml - a sample doc - solr-usr-answer.xml - the result in the searchbase -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep schema.xml Description: XML document solr-usr-question.xml Description: XML document solr-usr-answer.xml Description: XML document
Re: copyField of dates unworking?
On May 26, 2011, at 1:55 PM, anass talby wrote: it seems like reserved key words can't be used as field names did you try to changes your date field name? Interesting thought, but it didn't seem to help. I changed the schema so it has both a date and a eventDate field (so as not to invalidate my current data), and changed the copyField statement to from=eventDate. Then I added an eventData field to the test document mentioned earlier, with a one-second difference so I could be sure which was which. I added that doc, but the text field still doesn't have either date field. Any other thoughts why I can't copyField a date into a textgen? { responseHeader:{ status:0, QTime:5, params:{ indent:on, start:0, q:text:\example for list question\, version:2.2, rows:10}}, response:{numFound:1,start:0,docs:[ { id:jackrepenningdev-p1-svn-solr-user-question-1, item:r10, itemNumber:10, user:jackrepenning, date:2011-05-26T20:34:19Z, eventDate:2011-05-26T20:34:20Z, log:example for list question, organization:jackrepenningdev, project:p1, system:versioncontrol, subsystem:svn, class:operation, className:commit, text:[ r10, jackrepenning, M /trunk/cvsdude/solr/conf/schema.xml, example for list question], paths:[/trunk/cvsdude/solr/conf/schema.xml], changes:[M /trunk/cvsdude/solr/conf/schema.xml]}] }} -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Structured fields and termVectors
How does MoreLikeThis use termVectors? My documents (full sample at the bottom) frequently include lines more or less like this M /trunk/home/.Aquamacs/Preferences.el I want to MoreLikeThis based on the full path, but not the M. But what I actually display as a search result should include M (should look pretty much like the sample, below). If I define a field to include that whole line, I can certainly search in ways that skip the M, but how do I control the termVector and MoreLikeThis? I think the answer is not to termVector the line as shown, but rather to index these lines twice, once whole (which is also copyFielded into the display text), and a second time with just the path (and termVectors=true). Which is OK, but since these lines will represent most of my data, double-indexing seems to double my storage, which is ... oh, well ... not entirely optimal. So is there some way I can index the full line, once, with M and path, and tell the termVector to include the whole path and nothing but the path? -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep r3580 | jack | 2011-04-26 13:55:46 -0700 (Tue, 26 Apr 2011) | 1 line Changed paths: M /trunk/home/.Aquamacs M /trunk/home/.Aquamacs/Preferences.el M /trunk/www/wynton-start-page.html simplify the hijack of Aquamacs prefs storage, aufl PGP.sig Description: This is a digitally signed message part
Re: Support for huge data set?
On May 13, 2011, at 7:59 AM, Shawn Heisey wrote: The entire archive is about 80 terabytes, but we only index a subset of the metadata, stored in a MySQL database, which is about 100GB or so in size. The Solr index (version 1.4.1) consists of six large shards, each about 16GB in size, This is really useful data, Shawn, thanks! It's particularly interesting because the numbers are in the same ball-park as a project I'm considering. Can you clarify one thing? What's the relationship you're describing between MySQL and Solr? I think you're saying that there's a 80TB MySQL database, with a 100GB Solr system in front, is that right? Or is the entire 80TB accessible through Solr directly? -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part
Testing the limits of non-Java Solr
What's the probability that I can build a non-trivial Solr app without writing any Java? I've been planning to use Solr, Lucene, and existing plug-ins, and sort of hoping not to write any Java (the app itself is Ruby / Rails). The dox (such as http://wiki.apache.org/solr/FAQ) seem encouraging. [I *can* write Java, but my planning's all been no Java.] I'm just beginning the design work in earnest, and I suddenly notice that it seems every mail thread, blog, or example starts out Java-free, but somehow ends up involving Java code. I'm not sure I yet understand all these snippets; conceivably some of the Java I see could just as easily be written in another language, but it makes me wonder. Is it realistic to plan a sizable Solr application without some Java programming? I know, I know, I know: everything depends on the details. I'd be interested even in anecdotes: has anyone ever achieved this before? Also, what are the clues I should look for that I need to step into the Java realm? I understand, for example, that it's possible to write filters and tokenizers to do stuff not available in any standard one; in this case, the clue would be I can't find what I want in the standard list, I guess. Are there other things I should look for? -==- Jack Repenning Technologist Codesion Business Unit CollabNet, Inc. 8000 Marina Boulevard, Suite 600 Brisbane, California 94005 office: +1 650.228.2562 twitter: http://twitter.com/jrep PGP.sig Description: This is a digitally signed message part