How to 'filter' facet results

2010-07-27 Thread David Thompson
Is there a way to tell Solr to only return a specific set of facet values?  I 
feel like the facet query must be able to do this, but I'm not really 
understanding the facet query.  In my specific case, I'd like to only see facet 
values for the same values I pass in as query filters, i.e. if I run this query:
fq=keyword:man OR keyword:bear OR keyword:pig
facet=on
facet.field:keyword

then I only want it to return the facet counts for man, bear, and pig.  The 
resulting docs might have a number of different values for keyword, in addition 
to those specified in the filter because keyword is a multiValued field.  How 
can I tell it to only return the facet values for man, bear, and pig?  On the 
client side I could programmatically remove the other facets that I don't care 
about, except that the resulting docs could return hundreds of different 
values.  If I were faceting on a single value, I could say facet.prefix=man, 
and 
that would work, but mostly I need this to work for more than one filter value. 
 
Is there a way to set multiple facet.prefix values?  Any ideas?

-dKt



  

Re: PDF remote streaming extract with lots of multiValues

2010-07-09 Thread David Thompson
POSTing the individual parameters (literal.id, literal.mycategory, 
literal.mycategory) as name value pairs to 1.4's /update/extract does work.  I 
just realized the POST's content type hadn't been set to 
'application/x-www-form-urlencoded'.  Set it to that and it accepts all the 
parameters.

 -dKt




____
From: David Thompson 
To: solr-user@lucene.apache.org
Sent: Fri, July 9, 2010 12:17:59 PM
Subject: PDF remote streaming extract with lots of multiValues


How would I go about setting a large number of literal values in a call to 
index 
a remote PDF?  I'm currently calling:

http://host/solr/update/extract?literal.id=abc&literal.mycategory=blah&stream.url=http://otherhost/some/file.pdf


And that works great, except now I'm coming across usecases where I need send 
in 
hundreds, up to thousands, of different values for 'mycategory'.  So with 
mycategory defined as a multiValued string, I can call:

 
http://host/solr/update/extract?literal.id=abc&literal.mycategory=blah&literal.mycategory=foo&literal.mycategory=bar&stream.url=http://otherhost/some/file.pdf


and that works as expected.  But when I try to embed thousands of 
literal.mycategory parameters in the call, eventually my container says 'look, 
I've been forgiving about letting you GET URLs far longer than 1500 characters, 
but this is ridiculous' and barfs on it.  


I've tried POSTing a ... command, but it only pays 
attention to parameters in the URL query string, ignoring everything in the 
document.  I've seen some other threads that seem related, but now I'm just 
confused.  


What's the best way to tackle  this?

-dKt


  

PDF remote streaming extract with lots of multiValues

2010-07-09 Thread David Thompson
How would I go about setting a large number of literal values in a call to 
index 
a remote PDF?  I'm currently calling:

http://host/solr/update/extract?literal.id=abc&literal.mycategory=blah&stream.url=http://otherhost/some/file.pdf


And that works great, except now I'm coming across usecases where I need send 
in 
hundreds, up to thousands, of different values for 'mycategory'.  So with 
mycategory defined as a multiValued string, I can call:

 
http://host/solr/update/extract?literal.id=abc&literal.mycategory=blah&literal.mycategory=foo&literal.mycategory=bar&stream.url=http://otherhost/some/file.pdf


and that works as expected.  But when I try to embed thousands of 
literal.mycategory parameters in the call, eventually my container says 'look, 
I've been forgiving about letting you GET URLs far longer than 1500 characters, 
but this is ridiculous' and barfs on it.  


I've tried POSTing a ... command, but it only pays 
attention to parameters in the URL query string, ignoring everything in the 
document.  I've seen some other threads that seem related, but now I'm just 
confused.  


What's the best way to tackle this?

-dKt



  

Multiple Solr servers and a shared index vs master+slaves

2010-06-30 Thread David Thompson
I'm a newbie looking at setting up an intranet search service using Solr, so 
I'm having a hard time understanding why I should forego the high availability 
and clustering mechanisms we already have available, and use Solr's 
implementations instead.  I'm hoping some experienced Solr architects could 
take the time to comment.

Our corporate standard is for any java web app to be deployed as an ear file 
targeted to a 4-server Weblogic 10.3 cluster on virtual Solaris boxes, 
operating behind a cluster of Apache web servers.  All servers have NFS mounts 
to high availability SANs.  So my Solr proof-of-concept tries to make use of 
those tools.  I've deployed Solr to the cluster, and all of them use the same 
solr.home on the NFS mount.  This seems to be just fine for searching, query 
requests are evenly distributed across the cluster, and search performance 
seems to be fine with the index living on the NFS mount.  

The problems, of course, start when add/update requests come in.  This setup is 
the equivalent of having 4 standalone Solr servers using the same index.  So if 
I use the "simple" lock file mechanism, in my testing so far it seems to keep 
them all separate just fine, except that the first update comes in to serverA, 
it grabs the write lock, then if any other servers receive an update near the 
same time, it must wait for the write lock to be be removed by serverA after it 
commits.  I think I can pretty well mitigate this by directing all updates 
through a single server (via virtual IP address), but then I need the other 
servers to realize the index has changed after each commit.  It looks like I 
can make a call like http://serverB/solr/update/extract?commit=true and that's 
good enough to get it to open a new reader, but that seems a little clunky.  
I've read in another thread about the use of "commit hooks" that can trigger 
user-defined events, I think, so
 I'm looking into that now.

Now when I look at using Solr's master+slaves architecture, I feel like it's 
duplicating the trusted (and expensive) services we already have at our 
disposal.  Weblogic+Apache clusters do a good job of distributing load, 
monitoring health, failing-over, restarting, etc.  And if we used slaves that 
pulled index snapshots, they'd be using (by policy) the same NFS mount to store 
those snapshots, so we'd be pulling it over the wire only to write it right 
next to the original index.  If we didn't have these HA clustering mechanisms 
available already, then I'm sure I'd be much more willing to look at a Solr 
master+slave architecture.  But since we do, it seems like I'm a little bit 
hamstrung to use Solr's mechanisms anyway.  So, that's my scenario, comments 
welcome.  :)

 -dKt