Powered by Solr

2009-05-14 Thread Terence Gannon
I was intending to make an entry to the 'Powered by Solr' page, so I
created a Wiki account and logged in.  When I go to that page, it
shows it as being 'immutable', which I take as meaning I can't edit
it.  Is there someone I can send the information to who can do the
edit?  Or perhaps there is some sort of trick to editing that page?
Thanks for your help, and apologies in advance if this is a silly
question...

Terence


Re: Powered by Solr

2009-05-14 Thread Terence Gannon
 Did you try hitting refresh on your browser after you logged in?

Wow, I really should have known that...thank you for your patient reply, Yonik.

Regards...Terence


RE: Selective Searches Based on User Identity

2009-05-13 Thread Terence Gannon
Yes, the ownerUid will likely be assigned once and never changed.  But
you still need it, in order to keep track of who has contributed which
document.

I've been going over some of the simpler query scenarios, and Solr is
capable of handling them without having to resort to an external
RDBMS.  In order to limit documents to those which a given user owns,
or those to which he has been granted access, the syntax fragment
would be something like;

ownerUid:ab2734 or grantedUid:ab2734

where abs2734 is the uid for the user doing the query.  However, I'm
less comfortable with more complex query scenarios, particularly if
the concept of groups is eventually introduced, which is likely in my
scenario.
In the latter case, it may be necessary to use an external RDBMS.
I'll plead ignorance of the 'ineluctable filter query' and will have
to read up on that one.

With respect to updates to rights, they are not likely to be that
frequent, but when they are, they entire document will have to be
reindexed rather than simply updating the grantedUid and/or deniedUid
fields.  I don't believe Solr supports the updating of individual
fields, at least not yet.  This may be another reason to eventually go
to an external RDBMS.

Thanks very much for your help!

Terence

-Original Message-
From: Michael Ludwig
Sent: May 13, 2009 05:27
To: solr-user@lucene.apache.org
Subject: Re: Selective Searches Based on User Identity

Terence Gannon schrieb:
 Paul -- thanks for the reply, I appreciate it.  That's a very
 practical approach, and is worth taking a closer look at.  Actually,
 taking your idea one step further, perhaps three fields; 1) ownerUid
 (uid of the document's owner) 2) grantedUid (uid of users who have
 been granted access), and 3) deniedUid (uid of users specifically
 denied access to the document).

Grants might change quite a bit, the owner will likely remain the same.

Wouldn't it be better to include only the owner in the document and
store grants someplace else, like in an RDBMS or - if you don't want
one - a lightweight embedded database like BDB?

That way you could have your application tag an ineluctable filter query
onto each and every user query, which would ensure to include only those
documents in the results the owner of which has granted the user access.

Considering that I'm a Solr/Lucene newbie, this approach might have a
disadvantage that escapes me, which is why other people haven't made
this particular suggestion. If so, I'd be happy to learn why this isn't
preferable.

Michael Ludwig


RE: Selective Searches Based on User Identity

2009-05-12 Thread Terence Gannon
Paul -- thanks for the reply, I appreciate it.  That's a very practical
approach, and is worth taking a closer look at.  Actually, taking your idea
one step further, perhaps three fields; 1) ownerUid (uid of the document's
owner) 2) grantedUid (uid of users who have been granted access), and 3)
deniedUid (uid of users specifically denied access to the document).  These
fields, coupled with some business rules around how they were populated
should cover off all possibilities I think.

Access to the Solr instance would have to be tightly controlled, but that's
something that should be done anyway.  You sure wouldn't want end users
preparing their own XML and throwing it at Solr -- it would be pretty easy
to figure out how to get around the access/denied fields and get at stuff
the owner didn't intend.

This approach mimics to some degree what is being done in the operating
system, but it's still elegant and provides the level of control required.
 Anybody else have any thoughts in this regard?  Has anybody implemented
anything similar, and if so, how did it work?  Thanks, and best regards...

Terence


RE: Selective Searches Based on User Identity

2009-05-12 Thread Terence Gannon
Thanks for the tip.  I went to their website (www.fastsearch.com), and got
as far as the second line, top left 'A Microsoft Subsidiary'...at which
point, hopes of it being another open source solution quickly faded. ;-)
Seriously, though, it looks like an interesting product, but open source is
a mandatory requirement for my particular application.  But the fact they
implemented this functionality would seem to support that it's a valid
requirement, and I'll keep plugging away on it.  Thank you very much for
bringing FAST to my attention...I appreciate it!  Best regards...

Terence



-Original Message-
From: Matt Weber [mailto:m...@mattweber.org]
Sent: May 12, 2009 14:06
To: solr-user@lucene.apache.org
Subject: Re: Selective Searches Based on User Identity



I also work with the FAST Enterprise Search engine and this is exactly

how their Security Access Module works.  They actually use a modified

base-32 encoded value for indexing, but that is because they don't

have the luxury of untokenized/un-processed String fields like Solr.



Thanks,



Matt Weber

eSr Technologies

http://www.esr-technologies.com


RE: Selective Searches Based on User Identity

2009-05-12 Thread Terence Gannon
In reply to both Matt and Jay's comments, the particular situation I'm
dealing with is one where rights will change relatively little once
they are established.  Typically a document will be loaded and
indexed, and a decision will be made on sharing that more-or-less
immediately.  It might change a couple of times after that, but that
will be it.  So early-binding seems like the better option.  Thanks to
both of you for your suggestions and help.

Terence

PS. I wish I had known about that conference...looks like it would
have been very helpful to me right now!

-Original Message-
From: Matt Weber [mailto:m...@mattweber.org]
Sent: May 12, 2009 14:41
To: solr-user@lucene.apache.org
Subject: Re: Selective Searches Based on User Identity



Here is a good presentation on search security from the Infonortics

Search Conference that was held a few weeks ago.



http://www.infonortics.com/searchengines/sh09/slides/kehoe.pdf



The approach you are using is called early-binding.  As Jay mentioned,

one of the downsides is updating the documents each time you have an

ACL change.  You could use the late-binding approach that checks each

result after the query but before you display to the user.  I don't

recommend this approach because it will strain your security

infrastructure because you will need to check if the user can access

each result.



Good luck.



Thanks,



Matt Weber

eSr Technologies

http://www.esr-technologies.com


Selective Searches Based on User Identity

2009-05-11 Thread Terence Gannon
Can anybody point me in the direction of resources and/or projects regarding
the following scenario; I have a community of users contributing content to
a Solr index.  By default, the user (A) who contributes a document owns it,
and can see the document in their search results.  The owner can then grant
selective access to that document to other users.  If another user (B) is
granted access by A, then document shows up in B's search results, along
with whatever B has contributed and any other documents to which B has been
granted access.  Conversely, if B is not granted access to the document, it
does not show up in their search results.

I'm comfortable building this logic myself, so long as I'm not repeating the
work of others in this area.  Thanks, in advance, for any advice or
information.

Terence


Improving Readability of Hit Highlighting

2009-01-12 Thread Terence Gannon
I'm indexing text from an OCR of an old document.  Many words get read
perfectly, but they're typically embedded in a lot of junk.  I would
like the hit highlighting to show only the 'good' words, in the order
in which they appeared in the original document.  Is it possible to
use output of the filter classes as the text used in hit highlighting?
 Or do you have to all the text cleanup outside of Solr and present it
with two fields to index, one with the original text, and one with the
cleaned up text.  The objective of the hit highlighting is to give the
user a *sense* of the original context, even if it's not provided
verbatim from the original document.  Thanks in advance.

TerryG


Re: Improving Readability of Hit Highlighting

2009-01-12 Thread Terence Gannon
To answer your questions specifically, here is an example of the raw OCR output;

CONTRACTORINMPRIMENTAYIVE : mom Ale ACCEPT INFORMATIONON TOUR SHEET TO ea

to which I would like to see;

mom ale access tour sheet to

in the hit highlight.  My schema for this field is pretty much
standard, as follows;

tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.StopFilterFactory ...
filter class=solr.WordDelimiterFilterFactory ...
filter class=solr.LowerCaseFilterFactory ...
filter class=solr.EnglishPorterFilterFactory ...
filter class=solr.RemoveDuplicatesTokenFilterFactory ...

When I examine the effect of each of these with the Analyzer, it seems
like if I could use the output after LowerCaseFilterFactory in the hit
highlight, I would come close to achieving what I want.

I'm not averse to doing the text cleanup external to Solr before the
indexing, but only if it's *not* redundant to what the filter
factories are going to do anyway.  Thanks for your help!

TerryG