> > Maybe in the future we end up having a core part of Solr some sort of > offline processing capability so folks don’t have to deploy “yet another > system” ;-) > > This is something I've felt as a major gap for solr in general, however I don't think it makes sense to bake it directly in as part of solr which should remain focused on being a better server.
Providing processing that is designed explicitly for search is why I started my JesterJ project in 2013 (https://github.com/nsoft/jesterj). I've been tweaking it on/off for a while and it basically works, but there is still one thing I had planned to add before declaring 1.0. Specifically I wanted to have a solution for fault tolerance on complex processing... what I have now works for simple linear cases but branching/Joining (full DAG) processing where documents are split or cloned is not handled well (these complex designs work just fine other than fault tolerance) Since it supports DAG structure (including disconnected DAGs) it could easily have a branch or a parallel set of steps to produce what Joel wants, or just be run separately as needed. This would mean implementing a DocumentProcessor to calculate the binary file and perhaps a step to write the binary to disk, or one to send it to Solr if/when it gets a way of accepting the binary output directly. I've long had the intention that if it gained traction and non-trivial contributors, and signs that people were actually using it other than me I'd submit it as an Apache project, and hope for it to become the go-to answer for small, medium and medium-large production level use cases. (the thing in-between a DIH or SolrCell and a full build out of a spark streaming processing system). I think any sort of pre-processing for search should be feasible in it and it is meant to be trivial to run, requiring only the configuration (or authoring) of the steps desired, and providing it's own persistence and premade components for things like Tika, pulling data from database via JDBC, Scanning for docs on a filesystem, Pre-Analyzing fields based on a provided Solr Schema and sending documents to Solr. (plus some other basic manipulations: I just upgraded its dependencies to the current versions libraries on head and if you want to check it out ****DO NOT**** use the last release (it's ancient and buggy and java 8 only). Instead clone the latest and run: ./gradlew packUnoJar That should build, test and package to generate the full deps jar needed to run. If it gives you issues, please report them on github :) Unfortunately it doesn't have a mailing list with archives yet but I do have a free level slack channel for it. (Actually, I should probably switch to Discord shouldn't I since that solves the sign up problem... hmm maybe tonight) Note: only tested on Mac/Linux BYO Windows support for now (contributions welcome) The only thing standing between HEAD and a fresh Beta3 release (or perhaps a 1.0 and punt FTI DAG support to 1.1) is me finding time to re-document licenses of dependencies and verify that nothing contrary to Apache 2.0 has been pulled in (I'm endeavoring to follow https://www.apache.org/legal/resolved.html) If any of the above makes you curious, give it a spin and let me know how it goes. -Gus