Re: Schema Design/Data Import

2011-07-25 Thread Erick Erickson
I'd seriously consider going with SolrJ as your indexing strategy, it allows you to do anything you need to do in Java code. You can call the Tika library yourself on the files pointed to by your rows as you see fit, indexing them as you choose, perhaps one Solr doc per attachment, perhaps one per

Re: Schema Design/Data Import

2011-07-25 Thread Travis Low
Thanks so much Erick (and Stefan). Yes, I did some reading on SolrJ and Tika and you are spot-on. We will write our own importer using SolrJ and then we can grab the DB records and parse any attachments along the way. Now it comes down to a schema design question. The issue I'm struggling with

Re: Schema Design/Data Import

2011-07-25 Thread Stefan Matheis
Travis, that sounds like a perfect usecase for dynamic fields .. attachment_* and there you go. works for no attachment, as well as one, three or 50. for the user interface, you could iterate over them and show them as list - or something else that would fit your need. also, maybe, you

Re: Schema Design/Data Import

2011-07-25 Thread Erick Erickson
Well, the attachment_1, attachment_2 idea would be awkward to form queries (i.e. there would be 100 clauses if there were 100 docs?) Dynamic fields have this same problem. You could certainly index them all into a big field, just make it multivalued and do a SolrDocument.add(bigtextfield,

Re: Schema Design/Data Import

2011-07-25 Thread Stefan Matheis
Am 25.07.2011 16:58, schrieb Erick Erickson: Well, the attachment_1, attachment_2 idea would be awkward to form queries (i.e. there would be 100 clauses if there were 100 docs?) Dynamic fields have this same problem. Oh, yes .. correct .. overlooked that part :/ sorry.

Re: Schema design/data import

2011-07-21 Thread Stefan Matheis
Hey Travis, after reading your Mail .. and thinking a bit of it, i'm not sure if i would go with Nutch. Nutch is [from my understanding] more a crawler .. meant to crawl external / unknown sites. But, if it got this correct, you have a complete knowledge of your data and could solr exactly