Re: Specifying multiple documents in DataImportHandler dataConfig

Bertie Shen Sat, 07 Nov 2009 18:55:50 -0800

I have figured out a way to solve this problem: just specify a
single <document> blah blah blah </document>. Under <document>, specify
multiple top level entity entries, each of which corresponds to one table
data.


So each top level entry will map one row in it to a document in Lucene
index. <document> in DIH is *NOT* mapped to a document in Lucene index while
top-level entity is. I feel <document> tag is redundant and misleading in
data config and thus should be removed.

Cheers.

On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <[email protected]> wrote:

> I have the same problem. I had thought we could specify multiple <document>
> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
> But I found it was not the case. It only picks the first <document>blah blah
> blah</document> to do indexing.
>
> I think Rupert's  and my request are pretty common. Basically there are
> multiple tables in RDBMS, and we want each row in each table become a
> document in Lucene index. How can we write one data config.xml file to let
> DataImportHandler import multiple tables at the same time?
>
> Rupert, have you figured out a way to do it?
>
> Thanks.
>
>
>
> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <[email protected]> wrote:
>
>> Maybe I should be more clear: I have multiple tables in my DB that I
>> need to save to my Solr index. In my app code I have logic to persist
>> each table, which maps to an application model to Solr. This is fine.
>> I am just trying to speed up indexing time by using DIH instead of
>> going through my application. From what I understand of DIH I can
>> specify one dataSource element and then a series of document/entity
>> sets, for each of my models. But like I said before, DIH only appears
>> to want to index the first document declared under the dataSource tag.
>>
>> -Rupert
>>
>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<[email protected]> wrote:
>> > I am using the DataImportHandler with a JDBC datasource. From my
>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>> > Mesh Categories, etc I would construct a series of document/entity
>> > sets, like
>> >
>> > <dataConfig>
>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>> >
>> >    <!-- BLOG ENTRIES -->
>> >    <document name="blog_entries">
>> >      <entity name="blog_entries" query="select
>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>> > from blog_entries">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="title" name="text_t" />
>> >        <field column="data" name="text_t" />
>> >      </entity>
>> >    </document>
>> >
>> >    <!-- MESH CATEGORIES -->
>> >    <document name="mesh_category">
>> >      <entity name="mesh_categories" query="select
>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>> > mesh_categories">
>> >        <field column="id" name="pk_i" />
>> >        <field column="id" name="id" />
>> >        <field column="name" name="text_t" />
>> >        <field column="node_key" name="string" />
>> >        <field column="name_fc" name="facet_value" />
>> >        <field column="type" name="type_t" />
>> >      </entity>
>> >    </document>
>> > </datasource>
>> > </dataConfig>
>> >
>> >
>> > Solr parses this just fine and allows me to issue a
>> > /dataimport?command=full-import and it runs, but it only runs against
>> > the "first" document (blog_entries). It doesnt run against the 2nd
>> > document (mesh_categories).
>> >
>> > If I remove the 2 document elements and wrap both entity sets in just
>> > one document tag, then both sets get indexed, which seemingly achieves
>> > my goal. This just doesnt make sense from my understanding of how DIH
>> > works. My 2 content types are indeed separate so they logically
>> > represent two document types, not one.
>> >
>> > Is this correct? What am I missing here?
>> >
>> > Thanks
>> > -Rupert
>> >
>>
>
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Reply via email to