Yep, good ol’ ETL. My solution was dumping the data as one JSON object per 
document, an optional transform step, then a multi-threaded Python loader that 
was schema-independent. The multi-threaded loader ran way faster than DIH. 

This approach also easily supports re-indexing and re-do after a failure.

Back with Solr 1.3, before DIH, I wrote a Java program to fetch from the 
database, then load. That did some transformation, mostly making queue adds 
comparable with views (this was at Netflix).

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 29, 2025, at 10:05 AM, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote:
> 
> On 5/29/25 11:43, Sarah Weissman wrote:
> 
>> I’ve been banging my head against this all week and I’m trying to figure out 
>> the best way forward at this point. Is DIH still a viable option or should I 
>> be moving off of that something else? Any advice or perspectives on this 
>> would be appreciated.
> 
> All you need is a DB reader script, a Solr POSTer script, and a filter in 
> between where you can do all your transforms. It can easily take less time 
> and effort to write than figuring out how to get the DIH working "in the 
> cloud".
> 
> Dima
> 

Reply via email to