Re: Specifying multiple documents in DataImportHandler dataConfig

Lance Norskog Mon, 09 Nov 2009 18:02:12 -0800

There is a more fundamental problem here: Solr/Lucene index only
implements one table. If you have data from multiple tables in a
normalized index, you have denormalize the multi-table DB schema to
make a single-table Solr/Lucene index.


Your indexing will probably be faster if you a join in SQL to supply
your entire set of fields per database request.

2009/11/7 Noble Paul നോബിള്‍  नोब्ळ् <noble.p...@corp.aol.com>:
> On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <bertie.s...@gmail.com> wrote:
>> I have figured out a way to solve this problem: just specify a
>> single <document> blah blah blah </document>. Under <document>, specify
>> multiple top level entity entries, each of which corresponds to one table
>> data.
>>
>> So each top level entry will map one row in it to a document in Lucene
>> index. <document> in DIH is *NOT* mapped to a document in Lucene index while
>> top-level entity is. I feel <document> tag is redundant and misleading in
>> data config and thus should be removed.
>
> There are some common attributes specified at the <document> level .
> It still acts as a container tag .
>>
>> Cheers.
>>
>> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <bertie.s...@gmail.com> wrote:
>>
>>> I have the same problem. I had thought we could specify multiple <document>
>>> blah blah blah</document>s, each of which is mapping one table in the RDBMS.
>>> But I found it was not the case. It only picks the first <document>blah blah
>>> blah</document> to do indexing.
>>>
>>> I think Rupert's  and my request are pretty common. Basically there are
>>> multiple tables in RDBMS, and we want each row in each table become a
>>> document in Lucene index. How can we write one data config.xml file to let
>>> DataImportHandler import multiple tables at the same time?
>>>
>>> Rupert, have you figured out a way to do it?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <rufia...@gmail.com> wrote:
>>>
>>>> Maybe I should be more clear: I have multiple tables in my DB that I
>>>> need to save to my Solr index. In my app code I have logic to persist
>>>> each table, which maps to an application model to Solr. This is fine.
>>>> I am just trying to speed up indexing time by using DIH instead of
>>>> going through my application. From what I understand of DIH I can
>>>> specify one dataSource element and then a series of document/entity
>>>> sets, for each of my models. But like I said before, DIH only appears
>>>> to want to index the first document declared under the dataSource tag.
>>>>
>>>> -Rupert
>>>>
>>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<rufia...@gmail.com> wrote:
>>>> > I am using the DataImportHandler with a JDBC datasource. From my
>>>> > understanding of DIH, for each of my "content types" e.g. Blog posts,
>>>> > Mesh Categories, etc I would construct a series of document/entity
>>>> > sets, like
>>>> >
>>>> > <dataConfig>
>>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
>>>> >
>>>> >    <!-- BLOG ENTRIES -->
>>>> >    <document name="blog_entries">
>>>> >      <entity name="blog_entries" query="select
>>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
>>>> > from blog_entries">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="title" name="text_t" />
>>>> >        <field column="data" name="text_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> >
>>>> >    <!-- MESH CATEGORIES -->
>>>> >    <document name="mesh_category">
>>>> >      <entity name="mesh_categories" query="select
>>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from
>>>> > mesh_categories">
>>>> >        <field column="id" name="pk_i" />
>>>> >        <field column="id" name="id" />
>>>> >        <field column="name" name="text_t" />
>>>> >        <field column="node_key" name="string" />
>>>> >        <field column="name_fc" name="facet_value" />
>>>> >        <field column="type" name="type_t" />
>>>> >      </entity>
>>>> >    </document>
>>>> > </datasource>
>>>> > </dataConfig>
>>>> >
>>>> >
>>>> > Solr parses this just fine and allows me to issue a
>>>> > /dataimport?command=full-import and it runs, but it only runs against
>>>> > the "first" document (blog_entries). It doesnt run against the 2nd
>>>> > document (mesh_categories).
>>>> >
>>>> > If I remove the 2 document elements and wrap both entity sets in just
>>>> > one document tag, then both sets get indexed, which seemingly achieves
>>>> > my goal. This just doesnt make sense from my understanding of how DIH
>>>> > works. My 2 content types are indeed separate so they logically
>>>> > represent two document types, not one.
>>>> >
>>>> > Is this correct? What am I missing here?
>>>> >
>>>> > Thanks
>>>> > -Rupert
>>>> >
>>>>
>>>
>>>
>>
>
>
>
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
Lance Norskog
goks...@gmail.com

Re: Specifying multiple documents in DataImportHandler dataConfig

Reply via email to