HI Lance, I think you are discussing a different issue here. We are talking about each row from each table represents a document in index. You look to discuss about some documents may have multi-value fields which are stored in a separate table in RDBMS because of normalization.
On Mon, Nov 9, 2009 at 6:01 PM, Lance Norskog <goks...@gmail.com> wrote: > There is a more fundamental problem here: Solr/Lucene index only > implements one table. If you have data from multiple tables in a > normalized index, you have denormalize the multi-table DB schema to > make a single-table Solr/Lucene index. > > Your indexing will probably be faster if you a join in SQL to supply > your entire set of fields per database request. > > 2009/11/7 Noble Paul നോബിള് नोब्ळ् <noble.p...@corp.aol.com>: > > On Sun, Nov 8, 2009 at 8:25 AM, Bertie Shen <bertie.s...@gmail.com> > wrote: > >> I have figured out a way to solve this problem: just specify a > >> single <document> blah blah blah </document>. Under <document>, specify > >> multiple top level entity entries, each of which corresponds to one > table > >> data. > >> > >> So each top level entry will map one row in it to a document in Lucene > >> index. <document> in DIH is *NOT* mapped to a document in Lucene index > while > >> top-level entity is. I feel <document> tag is redundant and misleading > in > >> data config and thus should be removed. > > > > There are some common attributes specified at the <document> level . > > It still acts as a container tag . > >> > >> Cheers. > >> > >> On Sat, Nov 7, 2009 at 9:43 AM, Bertie Shen <bertie.s...@gmail.com> > wrote: > >> > >>> I have the same problem. I had thought we could specify multiple > <document> > >>> blah blah blah</document>s, each of which is mapping one table in the > RDBMS. > >>> But I found it was not the case. It only picks the first <document>blah > blah > >>> blah</document> to do indexing. > >>> > >>> I think Rupert's and my request are pretty common. Basically there are > >>> multiple tables in RDBMS, and we want each row in each table become a > >>> document in Lucene index. How can we write one data config.xml file to > let > >>> DataImportHandler import multiple tables at the same time? > >>> > >>> Rupert, have you figured out a way to do it? > >>> > >>> Thanks. > >>> > >>> > >>> > >>> On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <rufia...@gmail.com> > wrote: > >>> > >>>> Maybe I should be more clear: I have multiple tables in my DB that I > >>>> need to save to my Solr index. In my app code I have logic to persist > >>>> each table, which maps to an application model to Solr. This is fine. > >>>> I am just trying to speed up indexing time by using DIH instead of > >>>> going through my application. From what I understand of DIH I can > >>>> specify one dataSource element and then a series of document/entity > >>>> sets, for each of my models. But like I said before, DIH only appears > >>>> to want to index the first document declared under the dataSource tag. > >>>> > >>>> -Rupert > >>>> > >>>> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<rufia...@gmail.com> > wrote: > >>>> > I am using the DataImportHandler with a JDBC datasource. From my > >>>> > understanding of DIH, for each of my "content types" e.g. Blog > posts, > >>>> > Mesh Categories, etc I would construct a series of document/entity > >>>> > sets, like > >>>> > > >>>> > <dataConfig> > >>>> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." > /> > >>>> > > >>>> > <!-- BLOG ENTRIES --> > >>>> > <document name="blog_entries"> > >>>> > <entity name="blog_entries" query="select > >>>> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type > >>>> > from blog_entries"> > >>>> > <field column="id" name="pk_i" /> > >>>> > <field column="id" name="id" /> > >>>> > <field column="title" name="text_t" /> > >>>> > <field column="data" name="text_t" /> > >>>> > </entity> > >>>> > </document> > >>>> > > >>>> > <!-- MESH CATEGORIES --> > >>>> > <document name="mesh_category"> > >>>> > <entity name="mesh_categories" query="select > >>>> > id,name,node_key,name as name_fc,'MeshCategory' as type from > >>>> > mesh_categories"> > >>>> > <field column="id" name="pk_i" /> > >>>> > <field column="id" name="id" /> > >>>> > <field column="name" name="text_t" /> > >>>> > <field column="node_key" name="string" /> > >>>> > <field column="name_fc" name="facet_value" /> > >>>> > <field column="type" name="type_t" /> > >>>> > </entity> > >>>> > </document> > >>>> > </datasource> > >>>> > </dataConfig> > >>>> > > >>>> > > >>>> > Solr parses this just fine and allows me to issue a > >>>> > /dataimport?command=full-import and it runs, but it only runs > against > >>>> > the "first" document (blog_entries). It doesnt run against the 2nd > >>>> > document (mesh_categories). > >>>> > > >>>> > If I remove the 2 document elements and wrap both entity sets in > just > >>>> > one document tag, then both sets get indexed, which seemingly > achieves > >>>> > my goal. This just doesnt make sense from my understanding of how > DIH > >>>> > works. My 2 content types are indeed separate so they logically > >>>> > represent two document types, not one. > >>>> > > >>>> > Is this correct? What am I missing here? > >>>> > > >>>> > Thanks > >>>> > -Rupert > >>>> > > >>>> > >>> > >>> > >> > > > > > > > > -- > > ----------------------------------------------------- > > Noble Paul | Principal Engineer| AOL | http://aol.com > > > > > > -- > Lance Norskog > goks...@gmail.com >