On Wed, Nov 12, 2008 at 10:44 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
> Am I understanding the DIH correctly in that it doesn't work with Blobs and > or binary things? I'm basing this off of JdbcDataSource.getARow() which > seems to be the place that populates the Map that is then passed to the > Transformer. Actually, that switch statement in JdbcDataSource is redundant now. In our initial patches, the "field" in data-config had a type attribute. We used to attempt type conversion from the SQL type to the field's given type. We found that it was error prone and switched to using the ResultSet#getObject for all columns (making the old behavior a configurable option -- "convertType" in JdbcDataSource). The default is to use ResultSet#getObject which should handle BLOBs and CLOBs well. > One of the things that I think might be interesting is, as I'm integrating > Tika, the notion of a Transformer that takes a blob and feeds it to Tika for > parsing. In this way, people that store documents in databases (or download > PDFs, etc.) can use the DIH to bring in more kinds of content. > > Thoughts? > I think the best way would be a TikaEntityProcessor which knows how to handle documents. I guess a typical use-case would be FileListEntityProcessor->TikaEntityProcessor as parent-child entities. Also see SOLR-833 which adds a FieldReaderDataSource using which you can pass any field's content to an entity for processing. So you can have a [SqlEntityProcessor, JdbcDataSource] producing a blob and a [FieldReaderDataSource, TikaEntityProcessor] consuming it. I think such an integration will be very interesting. Let me know if you need a hand, I'm willing to contribute in whatever way possible. -- Regards, Shalin Shekhar Mangar.
