[Apologies if this is a duplicate -- I have sent several messages from my work 
email and they just vanish, so I subscribed with my personal email]
 
Greetings.  I am struggling to design a schema and a data import/update  
strategy for some semi-complicated data.  I would appreciate any input.

What we have is a bunch of database records that may or may not have files 
attached.  Sometimes no files, sometimes 50.
 
The requirement is to index the database records AND the documents,  and the 
search results would be just links to the database records.

I'd  love to crawl the site with Nutch and be done with it, but we have a  
complicated search form with various codes and attributes for the  database 
records, so we need a detailed schema that will loosely  correspond to boxes on 
the search form.  I don't think we could easily  do that if we just crawl the 
site.  But with a detailed schema, I'm  having trouble understanding how we 
could import and index from the  database, and also index the related files, 
and have the same schema  being populated, especially with the number of 
related documents being  variable (maybe index them all to one field?).
 
We have a lot of flexibility on how we can build this, so I'm open  to any 
suggestions or pointers for further reading.  I've spent a fair  amount of time 
on the wiki but I didn't see anything that seemed  directly relevant.
 
An additional difficulty, that I am willing to overlook for the  first cut, is 
that some of these files are zipped, and some of the zip  files may contain 
other zip files, to maybe 3 or 4 levels deep.  

Help, please?
 
cheers,

Travis

Reply via email to