Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-20 Thread Alan Burlison

Chris Hostetter wrote:


: 1) I think it should be a ServletFilter applied to all requests that
: will only process requests with a registered handler.

I'm not sure what it is in the above sentence ... i believe from the
context of the rest of hte message you are you refering to
using a ServletFilter instead of a Servlet -- i honestly have no opinion
about that either way.


I thought a filter required you to open up the WAR file and change 
web.xml, or am I misunderstanding?


--
Alan Burlison
--


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison

Chris Hostetter wrote:


i'm totally on board now ... the RequestParser decides where the streams
come from if any (post body, file upload, local file, remote url, etc...);
the RequestHandler decides what it wants to do with those streams, and has
a library of DocumentProcessors it can pick from to help it parse them if
it wants to, then it takes whatever actions it wants, and puts the
response information in the existing Solr(Query)Response class, which the
core hands off to any of the various OutputWriters to format according to
the users wishes.


+1

--
Alan Burlison
--


Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-17 Thread Alan Burlison

Ryan McKinley wrote:


In addition, consider the case where you want to index a SVN
repository.  Yes, this could be done in SolrRequestParser that logs in
and returns the files as a stream iterator.  But this seems like more
'work' then the RequestParser is supposed to do.  Not to mention you
would need to augment the Document with svn specific attributes.


This is indeed one of the things I'd like to do - use Solr as a back-end
for OpenGrok (http://www.opensolaris.org/os/project/opengrok/)

--
Alan Burlison
--



Re: Update Plugins (was Re: Handling disparate data sources in Solr)

2007-01-16 Thread Alan Burlison

Bertrand Delacretaz wrote:


With all this talk about plugins, registries etc., /me can't help
thinking that this would be a good time to introduce the Spring IoC
container to manage this stuff.

More info at http://www.springframework.org/docs/reference/beans.html
for people who are not familiar with it. It's very easy to use for
simple cases like the ones we're talking about.


Please, no.  I work on a big webapp that uses spring - it's a complete 
nightmare to figure out what's going on.


--
Alan Burlison
--


Re: To Spring or not to Spring? (was: Update Plugins)

2007-01-16 Thread Alan Burlison

Bertrand Delacretaz wrote:


Using just the IoC container? I'm not talking about full-blown Spring
magic, *just* IoC to assemble plugins.

Spring's IoC is not complicated, and logging statements and debuggers
are here to find out exactly what's happening if needed.

I don't think it'd be more complicated than using our homegrown plugin
system. Only more tested, documented and well-known.


It just seems like a big hammer to crack a small nut.  I've had *bad* 
experiences with apps where people pulled in just about every framework, 
component and widget you can think of - to understand what the hell is 
going on you end up having to be an expert in all of them.


Yes, I'm probably just paranoid ;-)

--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-10 Thread Alan Burlison

Chris Hostetter wrote:


: 1. The document is already decomposed into fields before the
: insert/update, but one or more of the fields requires special handling.

: 2. The document contains both metadata and content.  PDF is a good
: example of such a document type.

there's a third big example: multiple documents are compused into a single
stream of raw data, and you want Solr to extract the individual documents.
the simplest example of this case being that you want to point Solr at a
CSV file where each record is a document.


Or a tar file, or a zip file...  Yes, that definitely seems like 
something that should be covered as well.



: And for both of these you'd need to be able to specify the mapping
: between the data/metadata in the source document and the corresponding
: Solr schema fields.  I'm not sure if you'd want this in the
: solrconfig.xml file or in the indexing request itself.  Doing it in
: solrconfig.xml means you could change the disposition of the indexed
: data without changing the clients submitting the content.

right ... i think that's something that could be controlled on a per
parser basis, much they way RequestHandlers can currently take in a lot
of options at request time, but can also have default values (or
invariant values) specified for those options in the solrconfig when they
are registered.


Agreed.


: That was the reasoning behind my initial suggestion:
:
: | Extend the doc and field element with the following attributes:

Right, i was suggesting we take it to the next level, and allow for
plugins to handle updates that didn't have to have any XML encapsulation
at all -- the options and the raw data stream could be expressed entirely
in the HttpServletRequest for the update .. which would still allow us to
add the type of syntax you are describing to some new XmlUpdateSource
containing the refactored code which currently parses updates in SolrCore.


Hmm.  Any idea of how much work this involves?  As I said I can put time 
towards this, but I don't know the innards of Solr as well as you and 
the other folks on this list.


--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison

Chris Hostetter wrote:


what do you guys think?


I'm going to spend some time today looking at the Solr source and 
matching your suggestions to it, hopefully I'll be more able to give a 
slightly more considered opinion after that ;-)


I'm in the process of evaluating what we are going to do with the search 
functionality for http://opensolaris.org, and at the moment Solr is my 
first choice to replace what we already have - *if* it can be made to 
handle disparate data sources.


If I do decide that we are going to use Solr, I'll be happy to help add 
whatever extra functionality is needed to satisfy our requirements.  We 
need this fairly quickly, so I should be able to put a significant 
amount of time towards getting it done, once a design is fleshed out. 
I'm not a Solr expert (yet! ;-) so I'm grateful for whatever guidance 
the Solr community can give on how best to go about fulfilling our 
requirements.


I'm also wondering if we could use Solr to back-end the OpenGrok 
(http://www.opensolaris.org/os/project/opengrok/) source code search 
engine that we use on opensolaris.org - having a single search index for 
both site content and code might be useful, not least because we get the 
benefits of Solr the index distribution stuff.  OpenGrok already uses 
Lucene as it's back-end, so it should be possible to do this, although I 
haven't dug through the OG codebase yet.


--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison

Erik Hatcher wrote:

There really is no question of if Solr can be made to handle it. :)  


The if was a tuits if, not a technical if ;-)

POSTing an encoded binary document in XML will work, and it certainly 
will work to have Solr unencode it and parse it.


Yes, but the bits aren't there to do this (yet).  And I didn't want to 
do a one-off hack just for our purposes.


The Lucene in Action codebase has a DocumentHandler interface that could 
be used for this, which has implementations for Word, PDF, HTML, RTF, 
and some others.  It's simplistic, so it might not be of value 
specifically.


Do you have a pointer to the code?

Thanks,

--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison

Chris Hostetter wrote:


: The design issue for this is to be clear about the schema and how
: documents are mapped into the schema. If all document types are
: mapped into the same schema, then one type of query will work
: for all. If the documents have different schemas (in the search
: index), then the query needs an expansion specific to each
: document type.

Right, the only way to provide a general purpose solution is to make sure
any out of the box UpdateParsers (using the interface names from my
previous email) can be configured in the solrconfig.xml to map the native
concepts in the document format to user defined schema fields.



(people writing their own custom UpdateParsers could allways hardcode
their schema fields)

I don't know anything about PDF structure


http://en.wikipedia.org/wiki/Extensible_Metadata_Platform
http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf


but using your RFC-2822 email
as an example, the configuration for an Rfc2822UpdateParser would need to
be able to specify which Headers map to which fields, and what to do with
body text -- in theory, it could also be configured with refrences to
other UpdateParser instances for dealing with multi-part mime messages


There's two cases I can think of:

1. The document is already decomposed into fields before the 
insert/update, but one or more of the fields requires special handling. 
For example when indexing source code you could get the author, date, 
revision etc from the SCMS, but you might want to process the code 
itself just to extract identifiers and ignore keywords.  You might want 
different handlers for different languages, but for the resulting tokens 
all to be stored in the same field, irrespective of language.


2. The document contains both metadata and content.  PDF is a good 
example of such a document type.


You therefore need to be able to specify two types of preprocessing - 
either at the whole-document level, or at the individual field level. 
And for both of these you'd need to be able to specify the mapping 
between the data/metadata in the source document and the corresponding 
Solr schema fields.  I'm not sure if you'd want this in the 
solrconfig.xml file or in the indexing request itself.  Doing it in 
solrconfig.xml means you could change the disposition of the indexed 
data without changing the clients submitting the content.


That was the reasoning behind my initial suggestion:

| Extend the doc and field element with the following attributes:
|
| mime-type Mime type of the document, e.g. application/pdf, text/html
| and so on.
|
| encoding Encoding of the document, with base64 being the standard
| implementation.
|
| href The URL of any documents that can be accessed over HTTP, instead
| of embedding them in the indexing request.  The indexer would fetch
| the document using the specified URL.
|
| There would then be entries in the configuration file that map each
| MIME type to a handler that is capable of dealing with that document
| type.

So for case 1 where the source is locally accessible you might have 
something like this:


add
  doc
field name=authorAlan Burlison/field
field name=revision1.2/field
field name=date08-Jan-2007/field
field name=source mime-typetext/java
  href=file:///source/org/apache/foo/bar.java
/field
  /doc
/add

And for case 2 where the file can't be directly accessed you might have 
something like this:


add
  doc encoding=base64 mime-typeapplication/pdf
[base64-encoded version of the PDF file]
  /doc
/add

--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison

Chris Hostetter wrote:


For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture.  Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams into a unified Document
module -- it just tries to expose all of the Lucene goodness through an
easy to use, easy to configre, HTTP interface.  Besides the
configuration, Solr's other means of being a value add is in it's
IndexReader management, it's caching, and it's plugin support for mixing
and matching request handlers, output writters, and field types as easily
as you can mix and match Analyzers.

There has been some discussion about adding plugin support for the
update side of things as well -- at a very simple level this could allow
for messages to be sent via JSON, or CSV instead of just XML -- but
there's no reason a more comple upate plugin couldn't read in a binary PDF
file and parse it into it's appropriate fields ... but we aren't
quite there yet.  Feel free to bring this up on solr-dev if you'd be
interested in working on it.


I'm interested in discussing this further.  I've moved the discussion 
onto solr-dev, as suggested.


--
Alan Burlison
--


Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison

Original problem statement:

--
I'm considering using Solr to replace an existing bare-metal Lucene 
deployment - the current Lucene setup is embedded inside an existing 
monolithic webapp, and I want to factor out the search functionality 
into a separate webapp so it can be reused more easily.


At present the content of the Lucene index comes from many different 
sources (web pages, documents, blog posts etc) and can be different 
formats (plaintext, HTML, PDF etc).  All the various content types are 
rendered to plaintext before being inserted into the Lucene index.


The net result is that the data in one field in the index (say 
content) may have come from one of a number of source document types. 
 I'm having difficulty understanding how I might map this functionality 
onto Solr.  I understand how (for example) I could use 
HTMLStripStandardTokenizer to insert the contents of a HTML document 
into a field called content, but (assuming I'd written a PDF analyser) 
how would I insert the content of a PDF document into the same content 
field?


I know I could do this by preprocessing the various document types to 
plaintext in the various Solr clients before inserting the data into the 
index, but that means that each client would need to know how to do the 
document transformation.  As well as centralising the index, I also want 
to centralise the handling of the different document types.

--

My initial suggestion, to get the discussion started, is to extend the 
doc and field element with the following attributes:


mime-type
Mime type of the document, e.g. application/pdf, text/html and so on.

encoding
Encoding of the document, with base64 being the standard implementation.

href
The URL of any documents that can be accessed over HTTP, instead of 
embedding them in the indexing request.  The indexer would fetch the 
document using the specified URL.


There would then be entries in the configuration file that map each MIME 
type to a handler that is capable of dealing with that document type.


Thoughts?

--
Alan Burlison
--