Instead of handling it from within solr, I'd suggest writing an external application (e.g. in python using pysolr) that wraps the (fast) SQL query you like. Then retrieve a batch of documents, and write them to solr. For extra speed, don't commit until you're done.
/Martin On Wed, Dec 14, 2011 at 11:18 AM, Finotti Simone <tech...@yoox.com> wrote: > Hello, > I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my > Solr application to pull data from. > > Problem is that the document fields which I have to index aren't in the > same table, but I have to join records with two other tables. Well, in fact > they are views, but I don't think that this makes any difference. > > That's the data import handler that I've actually written: > > <?xml version="1.0"?> > <dataConfig> > <dataSource type="JdbcDataSource" > driver="net.sourceforge.jtds.jdbc.Driver" > url="jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1" > instance="SVCSQLDEV" /> > <document name="Products"> > <entity name="fd" query="SELECT * FROM clust_w_fast_dump ORDER BY > endeca_id;"> > <entity name="fd2" query="SELECT macrocolor_id, color_descr, > gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE > endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/> > <entity name="cpd" query="SELECT DepartmentCode, Ranking, > DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket > WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/> > <entity name="env" query="SELECT Environment FROM clust_w_Environment > WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/> > </entity> > </document> > </dataConfig> > > It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That > means that digesting the whole dataset would take 1 Ms (=> 12 days). > > The problem is that for each record in "fd", Solr makes three distinct > SELECT on the other three tables. Of course, this is absolutely inefficient. > > Is there a way to have Solr loading every record in the four tables and > join them when they are already loaded in memory? > > TIA >