I did this at Netflix with Solr 1.3, read stuff out of various databases and sent it all to Solr. I’m not sure DIH even existed then.
At Chegg, we have slightly more elaborate system because we have so many collections and data sources. Each content owner writes an “extractor” that makes a JSONL feed with the documents to index. We validate those, then have a common “loader” that reads the JSONL and sends it to Solr with multiple connections. Solr-specific stuff is done in update request processors. Document parsing is always in a separate process. I’ve implemented it that way three times with three different parser packages on two engines. Never on Solr, though. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 10, 2017, at 12:40 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > >> 4. Write an external program that fetches the file, fetches the metadata, >> combines them, and send them to Solr. > > I've done this with some custom crawls. Thanks to Erick Erickson, this is a > snap: > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > With the caveat that Tika should really be in a separate vm in production [1]. > > [1] > http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf > >