I'd seriously consider going with SolrJ as your indexing strategy, it allows you to do anything you need to do in Java code. You can call the Tika library yourself on the files pointed to by your rows as you see fit, indexing them as you choose, perhaps one Solr doc per attachment, perhaps one per row, whatever.
Best Erick On Wed, Jul 20, 2011 at 3:27 PM, <tra...@dawnstar.com> wrote: > > [Apologies if this is a duplicate -- I have sent several messages from my > work email and they just vanish, so I subscribed with my personal email] > > Greetings. I am struggling to design a schema and a data import/update > strategy for some semi-complicated data. I would appreciate any input. > > What we have is a bunch of database records that may or may not have files > attached. Sometimes no files, sometimes 50. > > The requirement is to index the database records AND the documents, and the > search results would be just links to the database records. > > I'd love to crawl the site with Nutch and be done with it, but we have a > complicated search form with various codes and attributes for the database > records, so we need a detailed schema that will loosely correspond to boxes > on the search form. I don't think we could easily do that if we just crawl > the site. But with a detailed schema, I'm having trouble understanding how > we could import and index from the database, and also index the related > files, and have the same schema being populated, especially with the number > of related documents being variable (maybe index them all to one field?). > > We have a lot of flexibility on how we can build this, so I'm open to any > suggestions or pointers for further reading. I've spent a fair amount of > time on the wiki but I didn't see anything that seemed directly relevant. > > An additional difficulty, that I am willing to overlook for the first cut, > is that some of these files are zipped, and some of the zip files may > contain other zip files, to maybe 3 or 4 levels deep. > > Help, please? > > cheers, > > Travis