Re: Large RDBMS dataset

Martin Koch Wed, 14 Dec 2011 07:16:34 -0800

Instead of handling it from within solr, I'd suggest writing an external
application (e.g. in python using pysolr) that wraps the (fast) SQL query
you like. Then retrieve a batch of documents, and write them to solr. For
extra speed, don't commit until you're done.


/Martin

On Wed, Dec 14, 2011 at 11:18 AM, Finotti Simone <tech...@yoox.com> wrote:

> Hello,
> I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my
> Solr application to pull data from.
>
> Problem is that the document fields which I have to index aren't in the
> same table, but I have to join records with two other tables. Well, in fact
> they are views, but I don't think that this makes any difference.
>
> That's the data import handler that I've actually written:
>
> <?xml version="1.0"?>
> <dataConfig>
>  <dataSource type="JdbcDataSource"
> driver="net.sourceforge.jtds.jdbc.Driver"
> url="jdbc:jtds:sqlserver://YSQLDEV01BLQ/YooxProcessCluster1"
> instance="SVCSQLDEV" />
>  <document name="Products">
>    <entity name="fd" query="SELECT * FROM clust_w_fast_dump ORDER BY
> endeca_id;">
>      <entity name="fd2" query="SELECT macrocolor_id, color_descr,
> gsize_descr, size_descr FROM clust_w_fast_dump2_ByMarkets WHERE
> endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/>
>      <entity name="cpd" query="SELECT DepartmentCode, Ranking,
> DepartmentPriceRangeCode FROM clust_w_CatalogProductsDepartments_ByMarket
> WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/>
>      <entity name="env" query="SELECT Environment FROM clust_w_Environment
> WHERE endeca_id='${fd.Endeca_ID}' ORDER BY endeca_id;"/>
>    </entity>
>  </document>
> </dataConfig>
>
> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That
> means that digesting the whole dataset would take 1 Ms (=> 12 days).
>
> The problem is that for each record in "fd", Solr makes three distinct
> SELECT on the other three tables. Of course, this is absolutely inefficient.
>
> Is there a way to have Solr loading every record in the four tables and
> join them when they are already loaded in memory?
>
> TIA
>

Re: Large RDBMS dataset

Reply via email to