I would *strongly* encourage you to store them together
as one document. There's no real method of doing
DB like joins in the underlying Lucene search engine.

But that's generic advice. The question I have for you is
"What's the big deal about coordinating the sources?"
That is, you have to have something that allows you to
make a 1:1 correspondence between your data sources
or you couldn't relate them in the first place. Is it really
that onerous to check?

If it is, why not build an index and search it when you
want to know?

Surrounding this question is "How often to you
really update data?" If it's once an hour, I submit
that you don't care how difficult finding out if there's
corresponding data in the other data set. If it's
once a second, that may be a different story.

You haven't described enough of your problem
space for me to render any opinion of whether
this is premature optimization or not, but it
sure smells like it from a distance <G>...

Best
Erick

On Jan 17, 2008 2:11 AM, Michael Lackhoff <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I have two sources of data for the same "things" to search. It is book
> data in a library. First there is the usual bibliographic data (author,
> title...) and then I have scanned and OCRed table of contents data about
> the same books. Both are updated independently.
> Now I don't know how to best index and search this data.
> - One option would be to save the data in different records. That would
>   make updates easy because I don't have to worry about the fields
>   from the other source. But searching would be more difficult: I have
>   to do an additional search for every hit in the "contents" data to
>   get the bibliographic data.
> - The other option would be to save everything in one record but then
>   updates would be difficult. Before I can update a record I must first
>   look if there is any data from the other source, merge it into the
>   record and only then update it. This option sounds very time consuming
>   for a complete reindex.
>
> The best solution would be some sort of join: Have two records in the
> index but always give both in the result no matter where the hit was.
> Any ideas on how to best organize this kind of data?
>
> -Michael
>
>

Reply via email to