Thank you guys for your advice! I would rather take advantage as much as possible of the existing handlers/processors.
I just realised that nested entities in DIH is extremely slow: I fixed that with a view on the DB (that does a join between 2 tables). The other thing I have to do is chain the extraction of the file content with the DIH: tomorrow I will experiment with the different datasources and processors supported by DIH. I have the feeling I will end up writing a separate service that extracts the content and puts it in the DB for faster indexing… I will report here my results in case other might find it useful. > On 10 Jul 2017, at 22:06, Walter Underwood <wun...@wunderwood.org> wrote: > > I did this at Netflix with Solr 1.3, read stuff out of various databases and > sent it all to Solr. I’m not sure DIH even existed then. > > At Chegg, we have slightly more elaborate system because we have so many > collections and data sources. Each content owner writes an “extractor” that > makes a JSONL feed with the documents to index. We validate those, then have > a common “loader” that reads the JSONL and sends it to Solr with multiple > connections. Solr-specific stuff is done in update request processors. > > Document parsing is always in a separate process. I’ve implemented it that > way three times with three different parser packages on two engines. Never on > Solr, though. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > >> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B. <talli...@mitre.org> wrote: >> >>> 4. Write an external program that fetches the file, fetches the metadata, >>> combines them, and send them to Solr. >> >> I've done this with some custom crawls. Thanks to Erick Erickson, this is a >> snap: >> https://lucidworks.com/2012/02/14/indexing-with-solrj/ >> >> With the caveat that Tika should really be in a separate vm in production >> [1]. >> >> [1] >> http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf >> >> >