On 23-Nov-07, at 7:28 AM, David Thibault wrote:
Hello all,
I'm new to Solr. From what little I have seen, Solr has made great
strides
in open source search, but is lacking some significant features
that would
really allow it to become a viable alternative to things like FAST and
Autonomy for enterprise search. I am sure these issues have been
discussed
on the list before, but I would like to help push these issues
forward if I
can:
It sounds to me like you are describing an application that can be
built with Solr rather than what Solr aims to provide. That said, I
see no reason that there couldn't exist some add-on modules providing
this functionality.
1) Crawling--ShareHound does windows shares, but it ignores
document-level
permissions. A modular approach to crawling file systems, websites,
intranet sites, etc, would be huge. Also, I realize Nutch has a
crawler but
Solr looks much more full-featured in terms of things like faceted
search,
etc, so I'd rather help push Solr forward.
It seems to be that every domain would require a different schema and
have different requirements. I'm not sure that the solution to this
problem belongs in Solr.
2) ACLs and document-level security--The lack of doc-level security
is a
real deal-breaker in terms of indexing enterprise fileshares. I could
envision this type of functionality to be embedded in the various
crawlers
above, on an OS-dependent or web app-dependent basis. For example,
when
indexing a file from a share, the ACL should be indexed as well,
that way a
results list can be brought back and the permissions would not need
to be
re-checked against the original file server. Also, this implies
that ACL
changes need to be monitored and updated as well as file content
changes.
Again, I don't see this as within the purview of Solr. Solr provides
lots of functionality to help implement access control (namely, rich
filtering and faceting support), and may provide more once updateable
documents are implemented. However, it has no concept of users,
files, permissions, monitoring os-level changes, etc. Growing such
awareness seems somewhat outside what Solr should provide.
There are other differences, obviously, between the leading commercial
products and Solr, but those two features alone would make a huge
difference
in the power of Solr, in my opinion. I have little Java experience,
but I
could easily prototype this functionality in other languages and
work with
others to integrate them into the code base in Java. Also, I
headed up an
enterprise search request for information for a large
pharmaceutical company
in the past, so I am familiar with the feature sets of FAST and
Autonomy,
and I could help manage the project in terms of competing feature
sets.
Again, this feels more like an application to me. I could see
someone putting together a solution to these problems in one package,
perhaps by distributing a separate webapp along with solr, complete
with a pre-defined schema, a nicer admin console, and automatic
crawling/indexing tools. In fact, I suspect such a product would be
very cool and garner lots of attention. I don't see Solr becoming
that product, though. Besides being outside the scope of the
project, I think there might be a lack of interest among the core
devs to develop and maintain that direction. Mightn't it be better
to start a separate project, where a different set of people (with
different priorities and interests) could have full control?
This situation is analogous to the Solr/Lucene: they are tightly
integrated, and several people contribute to both, but they are
different layers, and can proceed somewhat independently. And that
is a Good Thing.
-Mike