Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Chris Hostetter wrote: : 1) I think it should be a ServletFilter applied to all requests that : will only process requests with a registered handler. I'm not sure what it is in the above sentence ... i believe from the context of the rest of hte message you are you refering to using a ServletFilter instead of a Servlet -- i honestly have no opinion about that either way. I thought a filter required you to open up the WAR file and change web.xml, or am I misunderstanding? -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Chris Hostetter wrote: i'm totally on board now ... the RequestParser decides where the streams come from if any (post body, file upload, local file, remote url, etc...); the RequestHandler decides what it wants to do with those streams, and has a library of DocumentProcessors it can pick from to help it parse them if it wants to, then it takes whatever actions it wants, and puts the response information in the existing Solr(Query)Response class, which the core hands off to any of the various OutputWriters to format according to the users wishes. +1 -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Ryan McKinley wrote: In addition, consider the case where you want to index a SVN repository. Yes, this could be done in SolrRequestParser that logs in and returns the files as a stream iterator. But this seems like more 'work' then the RequestParser is supposed to do. Not to mention you would need to augment the Document with svn specific attributes. This is indeed one of the things I'd like to do - use Solr as a back-end for OpenGrok (http://www.opensolaris.org/os/project/opengrok/) -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Bertrand Delacretaz wrote: With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. -- Alan Burlison --
Re: To Spring or not to Spring? (was: Update Plugins)
Bertrand Delacretaz wrote: Using just the IoC container? I'm not talking about full-blown Spring magic, *just* IoC to assemble plugins. Spring's IoC is not complicated, and logging statements and debuggers are here to find out exactly what's happening if needed. I don't think it'd be more complicated than using our homegrown plugin system. Only more tested, documented and well-known. It just seems like a big hammer to crack a small nut. I've had *bad* experiences with apps where people pulled in just about every framework, component and widget you can think of - to understand what the hell is going on you end up having to be an expert in all of them. Yes, I'm probably just paranoid ;-) -- Alan Burlison --
Re: Handling disparate data sources in Solr
Chris Hostetter wrote: : 1. The document is already decomposed into fields before the : insert/update, but one or more of the fields requires special handling. : 2. The document contains both metadata and content. PDF is a good : example of such a document type. there's a third big example: multiple documents are compused into a single stream of raw data, and you want Solr to extract the individual documents. the simplest example of this case being that you want to point Solr at a CSV file where each record is a document. Or a tar file, or a zip file... Yes, that definitely seems like something that should be covered as well. : And for both of these you'd need to be able to specify the mapping : between the data/metadata in the source document and the corresponding : Solr schema fields. I'm not sure if you'd want this in the : solrconfig.xml file or in the indexing request itself. Doing it in : solrconfig.xml means you could change the disposition of the indexed : data without changing the clients submitting the content. right ... i think that's something that could be controlled on a per parser basis, much they way RequestHandlers can currently take in a lot of options at request time, but can also have default values (or invariant values) specified for those options in the solrconfig when they are registered. Agreed. : That was the reasoning behind my initial suggestion: : : | Extend the doc and field element with the following attributes: Right, i was suggesting we take it to the next level, and allow for plugins to handle updates that didn't have to have any XML encapsulation at all -- the options and the raw data stream could be expressed entirely in the HttpServletRequest for the update .. which would still allow us to add the type of syntax you are describing to some new XmlUpdateSource containing the refactored code which currently parses updates in SolrCore. Hmm. Any idea of how much work this involves? As I said I can put time towards this, but I don't know the innards of Solr as well as you and the other folks on this list. -- Alan Burlison --
Re: Handling disparate data sources in Solr
Chris Hostetter wrote: what do you guys think? I'm going to spend some time today looking at the Solr source and matching your suggestions to it, hopefully I'll be more able to give a slightly more considered opinion after that ;-) I'm in the process of evaluating what we are going to do with the search functionality for http://opensolaris.org, and at the moment Solr is my first choice to replace what we already have - *if* it can be made to handle disparate data sources. If I do decide that we are going to use Solr, I'll be happy to help add whatever extra functionality is needed to satisfy our requirements. We need this fairly quickly, so I should be able to put a significant amount of time towards getting it done, once a design is fleshed out. I'm not a Solr expert (yet! ;-) so I'm grateful for whatever guidance the Solr community can give on how best to go about fulfilling our requirements. I'm also wondering if we could use Solr to back-end the OpenGrok (http://www.opensolaris.org/os/project/opengrok/) source code search engine that we use on opensolaris.org - having a single search index for both site content and code might be useful, not least because we get the benefits of Solr the index distribution stuff. OpenGrok already uses Lucene as it's back-end, so it should be possible to do this, although I haven't dug through the OG codebase yet. -- Alan Burlison --
Re: Handling disparate data sources in Solr
Erik Hatcher wrote: There really is no question of if Solr can be made to handle it. :) The if was a tuits if, not a technical if ;-) POSTing an encoded binary document in XML will work, and it certainly will work to have Solr unencode it and parse it. Yes, but the bits aren't there to do this (yet). And I didn't want to do a one-off hack just for our purposes. The Lucene in Action codebase has a DocumentHandler interface that could be used for this, which has implementations for Word, PDF, HTML, RTF, and some others. It's simplistic, so it might not be of value specifically. Do you have a pointer to the code? Thanks, -- Alan Burlison --
Re: Handling disparate data sources in Solr
Chris Hostetter wrote: : The design issue for this is to be clear about the schema and how : documents are mapped into the schema. If all document types are : mapped into the same schema, then one type of query will work : for all. If the documents have different schemas (in the search : index), then the query needs an expansion specific to each : document type. Right, the only way to provide a general purpose solution is to make sure any out of the box UpdateParsers (using the interface names from my previous email) can be configured in the solrconfig.xml to map the native concepts in the document format to user defined schema fields. (people writing their own custom UpdateParsers could allways hardcode their schema fields) I don't know anything about PDF structure http://en.wikipedia.org/wiki/Extensible_Metadata_Platform http://partners.adobe.com/public/developer/en/xmp/sdk/XMPspecification.pdf but using your RFC-2822 email as an example, the configuration for an Rfc2822UpdateParser would need to be able to specify which Headers map to which fields, and what to do with body text -- in theory, it could also be configured with refrences to other UpdateParser instances for dealing with multi-part mime messages There's two cases I can think of: 1. The document is already decomposed into fields before the insert/update, but one or more of the fields requires special handling. For example when indexing source code you could get the author, date, revision etc from the SCMS, but you might want to process the code itself just to extract identifiers and ignore keywords. You might want different handlers for different languages, but for the resulting tokens all to be stored in the same field, irrespective of language. 2. The document contains both metadata and content. PDF is a good example of such a document type. You therefore need to be able to specify two types of preprocessing - either at the whole-document level, or at the individual field level. And for both of these you'd need to be able to specify the mapping between the data/metadata in the source document and the corresponding Solr schema fields. I'm not sure if you'd want this in the solrconfig.xml file or in the indexing request itself. Doing it in solrconfig.xml means you could change the disposition of the indexed data without changing the clients submitting the content. That was the reasoning behind my initial suggestion: | Extend the doc and field element with the following attributes: | | mime-type Mime type of the document, e.g. application/pdf, text/html | and so on. | | encoding Encoding of the document, with base64 being the standard | implementation. | | href The URL of any documents that can be accessed over HTTP, instead | of embedding them in the indexing request. The indexer would fetch | the document using the specified URL. | | There would then be entries in the configuration file that map each | MIME type to a handler that is capable of dealing with that document | type. So for case 1 where the source is locally accessible you might have something like this: add doc field name=authorAlan Burlison/field field name=revision1.2/field field name=date08-Jan-2007/field field name=source mime-typetext/java href=file:///source/org/apache/foo/bar.java /field /doc /add And for case 2 where the file can't be directly accessed you might have something like this: add doc encoding=base64 mime-typeapplication/pdf [base64-encoded version of the PDF file] /doc /add -- Alan Burlison --
Re: Handling disparate data sources in Solr
Chris Hostetter wrote: For your purposes, if you've got a system that works and does the Document conversion for you, then you are probably right: Solr may not be a usefull addition to your architecture. Solr doesn't really attempt to solve the problem of parsing differnet kinds of data streams into a unified Document module -- it just tries to expose all of the Lucene goodness through an easy to use, easy to configre, HTTP interface. Besides the configuration, Solr's other means of being a value add is in it's IndexReader management, it's caching, and it's plugin support for mixing and matching request handlers, output writters, and field types as easily as you can mix and match Analyzers. There has been some discussion about adding plugin support for the update side of things as well -- at a very simple level this could allow for messages to be sent via JSON, or CSV instead of just XML -- but there's no reason a more comple upate plugin couldn't read in a binary PDF file and parse it into it's appropriate fields ... but we aren't quite there yet. Feel free to bring this up on solr-dev if you'd be interested in working on it. I'm interested in discussing this further. I've moved the discussion onto solr-dev, as suggested. -- Alan Burlison --
Re: Handling disparate data sources in Solr
Original problem statement: -- I'm considering using Solr to replace an existing bare-metal Lucene deployment - the current Lucene setup is embedded inside an existing monolithic webapp, and I want to factor out the search functionality into a separate webapp so it can be reused more easily. At present the content of the Lucene index comes from many different sources (web pages, documents, blog posts etc) and can be different formats (plaintext, HTML, PDF etc). All the various content types are rendered to plaintext before being inserted into the Lucene index. The net result is that the data in one field in the index (say content) may have come from one of a number of source document types. I'm having difficulty understanding how I might map this functionality onto Solr. I understand how (for example) I could use HTMLStripStandardTokenizer to insert the contents of a HTML document into a field called content, but (assuming I'd written a PDF analyser) how would I insert the content of a PDF document into the same content field? I know I could do this by preprocessing the various document types to plaintext in the various Solr clients before inserting the data into the index, but that means that each client would need to know how to do the document transformation. As well as centralising the index, I also want to centralise the handling of the different document types. -- My initial suggestion, to get the discussion started, is to extend the doc and field element with the following attributes: mime-type Mime type of the document, e.g. application/pdf, text/html and so on. encoding Encoding of the document, with base64 being the standard implementation. href The URL of any documents that can be accessed over HTTP, instead of embedding them in the indexing request. The indexer would fetch the document using the specified URL. There would then be entries in the configuration file that map each MIME type to a handler that is capable of dealing with that document type. Thoughts? -- Alan Burlison --