Re: How to "chain" import handlers: import from DB and from file system

Walter Underwood Mon, 10 Jul 2017 13:24:12 -0700

I did this at Netflix with Solr 1.3, read stuff out of various databases and 
sent it all to Solr. I’m not sure DIH even existed then.

At Chegg, we have slightly more elaborate system because we have so many 
collections and data sources. Each content owner writes an “extractor” that 
makes a JSONL feed with the documents to index. We validate those, then have a 
common “loader” that reads the JSONL and sends it to Solr with multiple 
connections. Solr-specific stuff is done in update request processors.

Document parsing is always in a separate process. I’ve implemented it that way 
three times with three different parser packages on two engines. Never on Solr, 
though.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
>> 4. Write an external program that fetches the file, fetches the metadata, 
>> combines them, and send them to Solr.
> 
> I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
> snap:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> With the caveat that Tika should really be in a separate vm in production [1].
> 
> [1] 
> http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
>  
>

Re: How to "chain" import handlers: import from DB and from file system

Reply via email to