Thank you guys for your advice!

I would rather take advantage as much as possible of the existing 
handlers/processors.

I just realised that nested entities in DIH is extremely slow: I fixed that 
with a view on the DB (that does a join between 2 tables).

The other thing I have to do is chain the extraction of the file content with 
the DIH: tomorrow I will experiment with the different datasources and 
processors supported by DIH.
I have the feeling I will end up writing a separate service that extracts the 
content and puts it in the DB for faster indexing…

I will report here my results in case other might find it useful.



> On 10 Jul 2017, at 22:06, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> I did this at Netflix with Solr 1.3, read stuff out of various databases and 
> sent it all to Solr. I’m not sure DIH even existed then.
> 
> At Chegg, we have slightly more elaborate system because we have so many 
> collections and data sources. Each content owner writes an “extractor” that 
> makes a JSONL feed with the documents to index. We validate those, then have 
> a common “loader” that reads the JSONL and sends it to Solr with multiple 
> connections. Solr-specific stuff is done in update request processors.
> 
> Document parsing is always in a separate process. I’ve implemented it that 
> way three times with three different parser packages on two engines. Never on 
> Solr, though.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B. <talli...@mitre.org> wrote:
>> 
>>> 4. Write an external program that fetches the file, fetches the metadata, 
>>> combines them, and send them to Solr.
>> 
>> I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
>> snap:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>> 
>> With the caveat that Tika should really be in a separate vm in production 
>> [1].
>> 
>> [1] 
>> http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
>>  
>> 
> 

Reply via email to