Re: ExternalFileField2, massively scalable external file fields

Gus Heck Thu, 28 Jul 2022 11:29:39 -0700

>
> Maybe in the future we end up having a core part of Solr some sort of
> offline processing capability so folks don’t have to deploy “yet another
> system” ;-)
>
>
This is something I've felt as a major gap for solr in general, however I
don't think it makes sense to bake it directly in as part of solr which
should remain focused on being a better server.


Providing processing that is designed explicitly for search is why I
started my JesterJ project in 2013 (https://github.com/nsoft/jesterj). I've
been tweaking it on/off for a while and it basically works, but there is
still one thing I had planned to add before declaring 1.0. Specifically I
wanted to have a solution for fault tolerance on complex processing... what
I have now works for simple linear cases but branching/Joining (full DAG)
processing where documents are split or cloned is not handled well (these
complex designs work just fine other than fault tolerance)

Since it supports DAG structure (including disconnected DAGs) it could
easily have a branch or a parallel set of steps to produce what Joel wants,
or just be run separately as needed. This would mean implementing a
DocumentProcessor to calculate the binary file and perhaps a step to write
the binary to disk, or one to send it to Solr if/when it gets a way of
accepting the binary output directly.

I've long had the intention that if it gained traction and non-trivial
contributors, and signs that people were actually using it other than me
I'd submit it as an Apache project, and hope for it to become the go-to
answer for small, medium and medium-large production level use cases. (the
thing in-between a DIH or SolrCell and a full build out of a spark
streaming processing system).

I think any sort of pre-processing for search should be feasible in it and
it is meant to be trivial to run, requiring only the configuration (or
authoring) of the steps desired, and providing it's own persistence and
premade components for things like Tika, pulling data from database via
JDBC, Scanning for docs on a filesystem, Pre-Analyzing fields based on a
provided Solr Schema and sending documents to Solr. (plus some other basic
manipulations:

I just upgraded its dependencies to the current versions libraries on head
and if you want to check it out ****DO NOT**** use the last release (it's
ancient and buggy and java 8 only). Instead clone the latest and run:

./gradlew packUnoJar

That should build, test and package to generate the full deps jar needed to
run. If it gives you issues, please report them on github :) Unfortunately
it doesn't have a mailing list with archives yet but I do have a free level
slack channel for it. (Actually, I should probably switch to Discord
shouldn't I since that solves the sign up problem... hmm maybe tonight)

Note: only tested on Mac/Linux BYO Windows support for now (contributions
welcome)

The only thing standing between HEAD and a fresh Beta3 release (or perhaps
a 1.0 and punt FTI DAG support to 1.1) is me finding time to re-document
licenses of dependencies and verify that nothing contrary to Apache 2.0 has
been pulled in (I'm endeavoring to follow
https://www.apache.org/legal/resolved.html)

If any of the above makes you curious, give it a spin and let me know how
it goes.

-Gus

Re: ExternalFileField2, massively scalable external file fields

Reply via email to