Re: Where to find the Log file

2011-06-09 Thread Jack Repenning

On Jun 9, 2011, at 5:45 PM, Ruixiang Zhang wrote:

 Where can I find the log file of solr?  (I use
 Jetty)

By default, it's in yourapp/solr/logs/solr.log

 Is it turned on by default?

Yes. Oh, yes. Very much so. Uh-huh, you betcha.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Strategy -- Frequent updates in our application

2011-06-03 Thread Jack Repenning
On Jun 2, 2011, at 8:29 PM, Naveen Gupta wrote:

 and what about NRT, is it fine to apply in this case of scenario

Is NRT really what's wanted here? I'm asking the experts, as I have a situation 
 not too different from the b.p.

It appears to me (from the dox) that NRT makes a difference in the lag between 
a document being added and it being available in searches. But the BP really 
sounds to me like a concern over documents-added-per-second. Does the 
RankingAlgorithm form of NRT improve the docs-added-per-second performance?

My add-to-view limits aren't really threatened by Solr performance today; 
something like 30 seconds is just fine. But I am feeling close enough to the 
documents-per-second boundary that I'm pondering measures like master/slave. If 
NRT only improvs add-to-view lag, I'm not overly interested, but if it can 
improve add throughput, I'm all over it ;-)

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Using multiple CPUs for a single document base?

2011-05-31 Thread Jack Repenning
Is there a way to allow Solr to use multiple CPUs of a single, multi-core box, 
to increase scale (number of documents, number of searches) of the searchbase?

The CoreAdmin wiki page talks about Multiple Cores as essentially independent 
document bases with independent indexes, but with some unification of 
administration at the grosser levels. That's not quite what I'm looking for, 
though. I want a single URL for add and search access, and a single logical 
searchbase, but I want to be able to use more of the resources of the physical 
box where the searchbase runs.

I guess I thought I would get this for free, it being Java and all, but I don't 
seem to: even with hundreds of clients adding and searching, I only seem to use 
one hardware core, and a bit of a second (which I interpret to mean one Java 
thread for Solr, one Java thread for Java I/O).

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Using multiple CPUs for a single document base?

2011-05-31 Thread Jack Repenning
On May 31, 2011, at 11:16 AM, Markus Jelsma wrote:

 Are you using a  1.4 version of Solr?

Yeah, about those version numbers ... The tarball I installed claimed its 
version was

  apache-solr-3.1.0

Which sounds comfortably later than 1.4.

But the examples/solr/schema.xml that comes with it claims version 1.3.

I'm confused.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Using multiple CPUs for a single document base?

2011-05-31 Thread Jack Repenning

On May 31, 2011, at 11:29 AM, Jonathan Rochkind wrote:

 I kind of think you should get multi-CPU use 'for free' as a Java app too.

Ah, probably experimental error? If I apply a stress load consisting only of 
queries, I get automatic multi-core use as expected. I could see where indexing 
new dox could tend toward synchronization and uniprocessing. Perhaps my 
original test load was too add-centric, does that make sense?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Using multiple CPUs for a single document base?

2011-05-31 Thread Jack Repenning
On May 31, 2011, at 12:24 PM, Jonathan Rochkind wrote:

 I do all my 'adds' to a seperate Solr index, and then replicate to a slave 
 that actually serves queries.

Yes, that's a step I'm holding in reserve. Probably get there some day, as I 
expect always to have a very high add-to-query ratio. But for the moment, I 
don't think I need it.

 My 'master' that I do my adds to is actually on the very same server -- but I 
 run it in an entirely different java container,

Now THAT was an interesting data point, thanks very much! I hadn't thought of 
running the master on the same box!

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: Using multiple CPUs for a single document base?

2011-05-31 Thread Jack Repenning
On May 31, 2011, at 12:44 PM, Markus Jelsma wrote:

 I haven't given it a try but perhaps opening multiple HTTP connections to the 
 update handler will end up in multiple threads thus better CPU utilization. 

My original test case had hundreds of HTTP connections (all to the same URL) 
doing adds, but seemed to use only one CPU core for adding, or to serialize the 
adds somehow, something like that ... at any rate, I couldn't drive CPU use 
above ~120% with that configuration.

This is quite different from queries. For queries (or a rich query-to-add mix), 
I can easily drive CPU use into multiple-hundreds of % CPU, with just a few 
dozen concurrent query connections (running flat out). But adds resist that 
trick. I don't know whether this means that adds really are using a single 
thread, or if they're using multiple threads but synchronizing on some monitor. 
Actually, I can't say I care much: bottom line seems to be I only use one CPU 
core (plus a negligible marginal bit) for adds.

Since I've confirmed that queries spread neatly, I can live with the 
single-thready adds. In production, it seems likely that I'll be more or less 
continuously spending one CPU core on adds, and the rest on queries.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: What's your query result cache's stats?

2011-05-31 Thread Jack Repenning

On May 31, 2011, at 2:02 PM, Markus Jelsma wrote:

 the cumulative hit 
 ratio of the query result cache, it's almost never higher than 50%.
 
 What are your stats? How do you deal with it?

warmupTime : 0 
cumulative_lookups : 394867 
cumulative_hits : 394780 
cumulative_hitratio : 0.99 
cumulative_inserts : 87 
cumulative_evictions : 0 

Of course, that's shortly after I ran a query-intensive, not very creative load 
test (thousands of identical queries of a not very changeable data set). As a 
matter of fact, the numbers say I had exactly one miss after each insert, and 
everything else was a cache hit. Which makes perfect sense, for my (really 
dumb) test case.

 In some cases i have to disable it because of the high warming penalty i get 
 in a frequently changing index. This penalty is worse than the very little 
 performance gain i get. Different users accidentally using the same query or 
 a 
 single user that's actually browsing the result set only happens very 
 occasionally. And if i wanted the hit ratio to climb i'd have to increase the 
 cache size and warming size to absurd values, only then i might just reach 
 about 60% hit ratio.

If you have humans randomizing the query stream, I'm sure you're right. If 
you're convinced your queries are unrelated and variable, why would you expect 
a query cache to help at all?

On the other hand, I actually plan to use my Solr base to drive a UI, where the 
query parameters never change, and the data underneath changes mostly in bursts 
(generally near the end of the work day), so I suspect I'll only see misses 
after a document add, while lookups ten to cluster early in the day. So I 
actually am hoping for a high hit ratio.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Re: copyField of dates unworking?

2011-05-27 Thread Jack Repenning

On May 27, 2011, at 1:04 AM, Ahmet Arslan wrote:

 The letter f should be capital

Hah! Well-spotted! Thanks.

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


copyField of dates unworking?

2011-05-26 Thread Jack Repenning
Are there some sort of rules about what sort of fields can be copyFielded into 
other fields?

My schema has (among other things):

field name=date type=tdate   indexed=true  stored=true  
 required=true  /
field name=user type=string  indexed=true  stored=true  
 required=true  /
field name=text type=textgen indexed=true  stored=true 
 required=false
 multiValued=true /
 ...
  copyField source=user dest=text/
  copyfield source=date dest=text/
 

The user field gets copied into text just fine, but the date field does 
not. 

In case they're handy, I've attached:
 - schema.xml - the complete schema
 - solr-usr-question.xml - a sample doc
 - solr-usr-answer.xml - the result in the searchbase


-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep




schema.xml
Description: XML document


solr-usr-question.xml
Description: XML document


solr-usr-answer.xml
Description: XML document









Re: copyField of dates unworking?

2011-05-26 Thread Jack Repenning
On May 26, 2011, at 1:55 PM, anass talby wrote:

 it seems like reserved key words  can't be used as field names did you try
 to changes your date field name?

Interesting thought, but it didn't seem to help.

I changed the schema so it has both a date and a eventDate field (so as not 
to invalidate my current data), and changed the copyField statement to 
from=eventDate. Then I added an eventData field to the test document 
mentioned earlier, with a one-second difference so I could be sure which was 
which. I added that doc, but the text field still doesn't have either date 
field.

Any other thoughts why I can't copyField a date into a textgen?

{
  responseHeader:{
status:0,
QTime:5,
params:{
  indent:on,
  start:0,
  q:text:\example for list question\,
  version:2.2,
  rows:10}},
  response:{numFound:1,start:0,docs:[
  {
id:jackrepenningdev-p1-svn-solr-user-question-1,
item:r10,
itemNumber:10,
user:jackrepenning,
date:2011-05-26T20:34:19Z,
eventDate:2011-05-26T20:34:20Z,
log:example for list question,
organization:jackrepenningdev,
project:p1,
system:versioncontrol,
subsystem:svn,
class:operation,
className:commit,
text:[
  r10,
  jackrepenning,
  M /trunk/cvsdude/solr/conf/schema.xml,
  example for list question],
paths:[/trunk/cvsdude/solr/conf/schema.xml],
changes:[M /trunk/cvsdude/solr/conf/schema.xml]}]
  }}

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Structured fields and termVectors

2011-05-16 Thread Jack Repenning
How does MoreLikeThis use termVectors?

My documents (full sample at the bottom) frequently include lines more or less 
like this

   M /trunk/home/.Aquamacs/Preferences.el

I want to MoreLikeThis based on the full path, but not the M. But what I 
actually display as a search result should include M (should look pretty much 
like the sample, below).

If I define a field to include that whole line, I can certainly search in ways 
that skip the M, but how do I control the termVector and MoreLikeThis?  I 
think the answer is not to termVector the line as shown, but rather to index 
these lines twice, once whole (which is also copyFielded into the display 
text), and a second time with just the path (and termVectors=true). Which is 
OK, but since these lines will represent most of my data, double-indexing seems 
to double my storage, which is ... oh, well ... not entirely optimal.

So is there some way I can index the full line, once, with M and path, and 
tell the termVector to include the whole path and nothing but the path?



-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep




r3580 | jack | 2011-04-26 13:55:46 -0700 (Tue, 26 Apr 2011) | 1 line
Changed paths:
   M /trunk/home/.Aquamacs
   M /trunk/home/.Aquamacs/Preferences.el
   M /trunk/www/wynton-start-page.html

simplify the hijack of Aquamacs prefs storage, aufl




PGP.sig
Description: This is a digitally signed message part


Re: Support for huge data set?

2011-05-13 Thread Jack Repenning
On May 13, 2011, at 7:59 AM, Shawn Heisey wrote:

 The entire archive is about 80 terabytes, but we only index a subset of the 
 metadata, stored in a MySQL database, which is about 100GB or so in size.
 
 The Solr index (version 1.4.1) consists of six large shards, each about 16GB 
 in size,

This is really useful data, Shawn, thanks! It's particularly interesting 
because the numbers are in the same ball-park as a project I'm considering.

Can you clarify one thing? What's the relationship you're describing between 
MySQL and Solr? I think you're saying that there's a 80TB MySQL database, with 
a 100GB Solr system in front, is that right? Or is the entire 80TB accessible 
through Solr directly?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part


Testing the limits of non-Java Solr

2011-05-05 Thread Jack Repenning
What's the probability that I can build a non-trivial Solr app without writing 
any Java?

I've been planning to use Solr, Lucene, and existing plug-ins, and sort of 
hoping not to write any Java (the app itself is Ruby / Rails). The dox (such as 
http://wiki.apache.org/solr/FAQ) seem encouraging. [I *can* write Java, but my 
planning's all been no Java.]

I'm just beginning the design work in earnest, and I suddenly notice that it 
seems every mail thread, blog, or example starts out Java-free, but somehow 
ends up involving Java code. I'm not sure I yet understand all these snippets; 
conceivably some of the Java I see could just as easily be written in another 
language, but it makes me wonder. Is it realistic to plan a sizable Solr 
application without some Java programming?

I know, I know, I know: everything depends on the details. I'd be interested 
even in anecdotes: has anyone ever achieved this before? Also, what are the 
clues I should look for that I need to step into the Java realm? I understand, 
for example, that it's possible to write filters and tokenizers to do stuff not 
available in any standard one; in this case, the clue would be I can't find 
what I want in the standard list, I guess. Are there other things I should 
look for?

-==-
Jack Repenning
Technologist
Codesion Business Unit
CollabNet, Inc.
8000 Marina Boulevard, Suite 600
Brisbane, California 94005
office: +1 650.228.2562
twitter: http://twitter.com/jrep











PGP.sig
Description: This is a digitally signed message part