I have figured out a way to solve this problem: just specify a single <document> blah blah blah </document>. Under <document>, specify multiple top level entity entries, each of which corresponds to one table data.
So each top level entry will map one row in it to a document in Lucene index. <document> in DIH is *NOT* mapped to a document in Lucene index while top-level entity is. I feel <document> tag is redundant and misleading in data config and thus should be removed. Cheers. On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <bertie.s...@gmail.com> wrote: > I have the same problem. I had thought we could specify multiple <document> > blah blah blah</document>s, each of which is mapping one table in the RDBMS. > But I found it was not the case. It only picks the first <document>blah blah > blah</document> to do indexing. > > I think Rupert's and my request are pretty common. Basically there are > multiple tables in RDBMS, and we want each row in each table become a > document in Lucene index. How can we write one data config.xml file to let > DataImportHandler import multiple tables at the same time? > > Rupert, have you figured out a way to do it? > > Thanks. > > > > On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <rufia...@gmail.com> wrote: > >> Maybe I should be more clear: I have multiple tables in my DB that I >> need to save to my Solr index. In my app code I have logic to persist >> each table, which maps to an application model to Solr. This is fine. >> I am just trying to speed up indexing time by using DIH instead of >> going through my application. From what I understand of DIH I can >> specify one dataSource element and then a series of document/entity >> sets, for each of my models. But like I said before, DIH only appears >> to want to index the first document declared under the dataSource tag. >> >> -Rupert >> >> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<rufia...@gmail.com> wrote: >> > I am using the DataImportHandler with a JDBC datasource. From my >> > understanding of DIH, for each of my "content types" e.g. Blog posts, >> > Mesh Categories, etc I would construct a series of document/entity >> > sets, like >> > >> > <dataConfig> >> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." /> >> > >> > <!-- BLOG ENTRIES --> >> > <document name="blog_entries"> >> > <entity name="blog_entries" query="select >> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type >> > from blog_entries"> >> > <field column="id" name="pk_i" /> >> > <field column="id" name="id" /> >> > <field column="title" name="text_t" /> >> > <field column="data" name="text_t" /> >> > </entity> >> > </document> >> > >> > <!-- MESH CATEGORIES --> >> > <document name="mesh_category"> >> > <entity name="mesh_categories" query="select >> > id,name,node_key,name as name_fc,'MeshCategory' as type from >> > mesh_categories"> >> > <field column="id" name="pk_i" /> >> > <field column="id" name="id" /> >> > <field column="name" name="text_t" /> >> > <field column="node_key" name="string" /> >> > <field column="name_fc" name="facet_value" /> >> > <field column="type" name="type_t" /> >> > </entity> >> > </document> >> > </datasource> >> > </dataConfig> >> > >> > >> > Solr parses this just fine and allows me to issue a >> > /dataimport?command=full-import and it runs, but it only runs against >> > the "first" document (blog_entries). It doesnt run against the 2nd >> > document (mesh_categories). >> > >> > If I remove the 2 document elements and wrap both entity sets in just >> > one document tag, then both sets get indexed, which seemingly achieves >> > my goal. This just doesnt make sense from my understanding of how DIH >> > works. My 2 content types are indeed separate so they logically >> > represent two document types, not one. >> > >> > Is this correct? What am I missing here? >> > >> > Thanks >> > -Rupert >> > >> > >