I'd seriously consider going with SolrJ as your indexing strategy, it allows
you to do anything you need to do in Java code. You can call the Tika
library yourself on the files pointed to by your rows as you see fit, indexing
them as you choose, perhaps one Solr doc per attachment, perhaps one
per
Thanks so much Erick (and Stefan). Yes, I did some reading on SolrJ and
Tika and you are spot-on. We will write our own importer using SolrJ and
then we can grab the DB records and parse any attachments along the way.
Now it comes down to a schema design question. The issue I'm struggling
with
Travis,
that sounds like a perfect usecase for dynamic fields .. attachment_*
and there you go. works for no attachment, as well as one, three or 50.
for the user interface, you could iterate over them and show them as
list - or something else that would fit your need.
also, maybe, you
Well, the attachment_1, attachment_2 idea would be awkward
to form queries (i.e. there would be 100 clauses if there were 100 docs?)
Dynamic fields have this same problem.
You could certainly index them all into a big field, just make it
multivalued and do a SolrDocument.add(bigtextfield,
Am 25.07.2011 16:58, schrieb Erick Erickson:
Well, the attachment_1, attachment_2 idea would be awkward
to form queries (i.e. there would be 100 clauses if there were 100 docs?)
Dynamic fields have this same problem.
Oh, yes .. correct .. overlooked that part :/ sorry.
Hey Travis,
after reading your Mail .. and thinking a bit of it, i'm not sure if i
would go with Nutch. Nutch is [from my understanding] more a crawler ..
meant to crawl external / unknown sites.
But, if it got this correct, you have a complete knowledge of your data
and could solr exactly