Hi,
I'm new to Solr (and Lucene) and I'm trying to work out just how I
could fit this technology into my app (I'm moving over from using
MySQL fulltext indexes). Things are actually going really well - the
facet functionality fits in just perfectly, and the basic full-text
searching is working very well for me as well, especially considering
that I'm trying to index several languages at once. It's really much,
much faster than MySQL. Somehow, I thought that would be the hard
part! Unfortunately, I'm getting tripped up on something that seems
far more complicated...
So, there are two kinds of searches you can do in this application.
There's an "Advanced Search" and a basic "Text Search". For the
Advanced Search, users pick out one or more sets of documents which
they are allowed to see, and some set of tags to filter by, and they
get a list of documents. This part is easy, I can do all of this with
the functionality I picked up reading the docs and tutorials, and
since my application is handling what sets of documents that my users
can choose, Solr doesn't need to know anything about the permissions
model.
The text search is where I'm running into trouble. Right now, the
application automatically filters the documents to search through with
a join in MySQL. In order to do this through Solr, I need to figure
out a good way for Solr to know what sets of documents in which to
search.
Here's what I have so far:
1) Each document has a field folder_id, which contains one value,
which is the ID of the folder to which the document belongs. There
are right now about 6000 different folders altogether.
2) Each user is permitted to see documents from a particular subset
of folders. Some users can see only 100-200 folders, some users can
see 4000-5000 folders (all depends on what they have subscribed to).
In the advanced search, in order to restrict the available documents,
I use a filter query: fq=folder_id:1 OR folder_id:2 etc... In the
advanced search, the user is only ever searching through a max of 80
or 90 folders (and usually more like 1 or 2), so this seems quite
workable.
However, in the plain text search, the user automatically searches
through *all* of the folders to which they have subscribed. This
means, for (good!) users who have subscribed to a large (1000+) number
of folders, the filter query would be quite long, and would exceed the
default number of boolean parameters allowed. Of course, I could just
increase the limit, but the fact that a limit is there in the first
place leads me to believe this is probably not the most scalable
solution.
Now, I'm reading on this tutorial page for Lucene: http://www.lucenetutorial.com/techniques/permission-filtering.html
that the best way to do this would involve some combination of
HitCollector & FieldCache. From what the author is saying, this
sounds like exactly what I need. Unfortunately, I am almost
completely Java-illiterate, and on top of that, I'm not really
finding any explanation of:
a) What exactly I would do with the HitCollector & FieldCache objects
that would help me achieve this goal - even just at the level of
Lucene, there's no real explanation in the tutorial
or
b) Where exactly these classes fit in to Solr (if they do at all)
So far I have already written my own (tiny, tiny) Tokenizer and
TokenizerFactory for correctly parsing the tags that come in from the
database, and that works great, so I'm thinking, if there's something
I can sub-class or modify somewhere to get this working, even with my
meager Java knowledge I could do it... But I have no clue even where
to start with this. Do I need my own custom version of
SolrIndexSearcher, or SolrRequestHandler... or some other class I
haven't even gotten to yet?
If it helps, I am using version 1.2, and trying to integrate this with
a LAMP-based application. I already have hooks set up to allow PHP to
index documents, query solr, and parse responses. Since everything
else is already working so well, and it's just a matter of getting
permissions working, I would really, really like to stick with Solr.
Has anyone done anything like this or can point me in the right
direction? I can figure out the mechanics of getting the list of
allowed folder_ids to Solr, all I really need to know is what kind of
modifications I would need to make, where, to get Solr to limit the
search to a particular subset of documents without using a gigantic
filter query.
Many thanks for any advice. My apologies if this has been asked a
million times before, I am new to the list however I did read and
search through the archives and didn't really find anything on this
subject.
Best regards,
Steve