Mike, it is very nice. Thank you.
On Sun, Dec 17, 2023 at 4:49 PM Mike Drob <md...@mdrob.com> wrote: > You can! > https://lists.apache.org/thread/brw7r0cf0t0m1wltxg5sky6t6d9crgxm > > On Sun, Dec 17, 2023 at 3:12 PM Vince McMahon < > sippingonesandze...@gmail.com> > wrote: > > > Thanks, Gus. I wish I can "bookmark this reply. lol. > > > > On Sat, Dec 16, 2023 at 11:10 PM Gus Heck <gus.h...@gmail.com> wrote: > > > > > Yes. see the detectChangesViaHashing option here: > > > https://github.com/nsoft/jesterj/wiki/Scanners > > > > > > In any Lucene index there's not really such a thing as incremental > > update. > > > When you want to do an "update" you send the whole document, and it's > > > really a delete/insert the under the covers (there's some esoteric > > > exceptions, but generally this is true). So the way to think about this > > in > > > search ingestion is "did the document change?" If so, send it again. > > There > > > are two general strategies, either consulting a "modifiedTime" field > > (which > > > has to be trustworthy, and still requires persistence to handle > deletest) > > > and hashing the document bytes (which always works without schema > changes > > > but can be sensitive to trivial changes). > > > > > > Jesterj has an embedded cassandra which tracks document processing > > statuses > > > and (if you configure it) will also use this embedded cassandra to > > remember > > > the hash of the document bytes (in your case the contents of a row) and > > > then on subsequent scans check each row against the prior hash. The > main > > > caveat is that the doucment IDs must contain sufficient information to > > > reproducibly retrieve the document By default the concept is a document > > > url, though one could conceivably customize things to use some other > > > scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42 > would > > > be used. In that example, the table primary key value is '42'. Of > > course > > > it takes more CPU to hash the contents of course so depending on the > size > > > of your data this may or may not be practical. However, for a few tens > of > > > thousands of documents, like you describe, this should work fine. > > > > > > Actual deletes *should* be supported too. (if it's not working let me > > > know). Order is only guaranteed for a given path through the DAG, so If > > you > > > manage to design a system where documents linger a long time in some > the > > > ingest paths but not others (longer than a typical scan interval) then > > > there will be the possibility of a delete winning the race with an > update > > > or vice-versa, but that maybe is a point at which you should look at > your > > > processing more carefully anyway. Right now everything is scan/pull > > based, > > > push based document sources are also possible but hasn't been needed > yet, > > > and thus hasn't been added. > > > > > > -Gus > > > > > > > > > On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon < > > > sippingonesandze...@gmail.com> > > > wrote: > > > > > > > I am impressed, Gus. Does it handle incremental changes from the > > source > > > db > > > > tables, such as insert, update, and delete. > > > > > > > > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com> > wrote: > > > > > > > > > Have you considered trying an existing document ingestion > framework? > > I > > > > > wrote this one: https://github.com/nsoft/jesterj It already has a > > > > database > > > > > connector. If you do check it out and find difficulty please let me > > > know > > > > by > > > > > leaving bug reports (if bug) or feedback (if confusion) in the > > > > discussions > > > > > section here: https://github.com/nsoft/jesterj/discussions > > > > > > > > > > As Mikhail noted, it's not easy to build a robust ingestion system > > from > > > > > scratch. > > > > > > > > > > -Gus > > > > > > > > > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk < > > > dmitri.maz...@gmail.com> > > > > > wrote: > > > > > > > > > > > On 12/15/23 05:41, Vince McMahon wrote: > > > > > > > Ishan, you are right. Doing multithreaded Indexing is going > much > > > > > faster. > > > > > > > I found out after the remote machine became unresponsive very > > > > quickly ; > > > > > > it > > > > > > > crashed. lol. > > > > > > FWIW I got better results posting docs in batches from a single > > > thread. > > > > > > Work is in a "private org" on gitlab so I can't post the link to > > the > > > > > > code, but the basic layout is a DB reader that yields rows and a > > > writer > > > > > > that does requests.post() of a list of JSON docs. With the DB > row > > -> > > > > > > JSON doc transformer in-between. > > > > > > > > > > > > I played with the size of the batch as well as async/await queue > > > before > > > > > > leaving it single-threaded w/ batch size of 5K docs: I had no > speed > > > > > > advantage with larger batches in our setup. And it doesn't DDoS > the > > > > > > index. ;) > > > > > > > > > > > > Dima > > > > > > > > > > > > > > > > > > > > > > -- > > > > > http://www.needhamsoftware.com (work) > > > > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > > > > > > > > > > > > > > > > -- > > > http://www.needhamsoftware.com (work) > > > https://a.co/d/b2sZLD9 (my fantasy fiction book) > > > > > >