Mike, it is very nice.

Thank you.

On Sun, Dec 17, 2023 at 4:49 PM Mike Drob <md...@mdrob.com> wrote:

> You can!
> https://lists.apache.org/thread/brw7r0cf0t0m1wltxg5sky6t6d9crgxm
>
> On Sun, Dec 17, 2023 at 3:12 PM Vince McMahon <
> sippingonesandze...@gmail.com>
> wrote:
>
> > Thanks, Gus.  I wish I can "bookmark this reply. lol.
> >
> > On Sat, Dec 16, 2023 at 11:10 PM Gus Heck <gus.h...@gmail.com> wrote:
> >
> > > Yes. see the detectChangesViaHashing option here:
> > > https://github.com/nsoft/jesterj/wiki/Scanners
> > >
> > > In any Lucene index there's not really such a thing as incremental
> > update.
> > > When you want to do an "update" you send the whole document, and it's
> > > really a delete/insert the under the covers (there's some esoteric
> > > exceptions, but generally this is true). So the way to think about this
> > in
> > > search ingestion is "did the document change?" If so, send it again.
> > There
> > > are two general strategies, either consulting a "modifiedTime" field
> > (which
> > > has to be trustworthy, and still requires persistence to handle
> deletest)
> > > and hashing the document bytes (which always works without schema
> changes
> > > but can be sensitive to trivial changes).
> > >
> > > Jesterj has an embedded cassandra which tracks document processing
> > statuses
> > > and (if you configure it) will also use this embedded cassandra to
> > remember
> > > the hash of the document bytes (in your case the contents of a row) and
> > > then on subsequent scans check each row against the prior hash. The
> main
> > > caveat is that the doucment IDs must contain sufficient information to
> > > reproducibly retrieve the document By default the concept is a document
> > > url, though one could conceivably customize things to use some other
> > > scheme. By default a url like jdbc:mydb:localhost:1234/sometable/42
> would
> > > be used. In that example, the table primary key value is '42'.   Of
> > course
> > > it takes more CPU to hash the contents of course so depending on the
> size
> > > of your data this may or may not be practical. However, for a few tens
> of
> > > thousands of documents, like you describe, this should work fine.
> > >
> > > Actual deletes *should* be supported too. (if it's not working let me
> > > know). Order is only guaranteed for a given path through the DAG, so If
> > you
> > > manage to design a system where documents linger a long time in some
> the
> > > ingest paths but not others (longer than a typical scan interval) then
> > > there will be the possibility of a delete winning the race with an
> update
> > > or vice-versa, but that maybe is a point at which you should look at
> your
> > > processing more carefully anyway. Right now everything is scan/pull
> > based,
> > > push based document sources are also possible but hasn't been needed
> yet,
> > > and thus hasn't been added.
> > >
> > > -Gus
> > >
> > >
> > > On Fri, Dec 15, 2023 at 8:35 PM Vince McMahon <
> > > sippingonesandze...@gmail.com>
> > > wrote:
> > >
> > > > I am impressed, Gus.  Does it handle incremental changes from the
> > source
> > > db
> > > > tables, such as insert, update, and delete.
> > > >
> > > > On Fri, Dec 15, 2023 at 12:58 PM Gus Heck <gus.h...@gmail.com>
> wrote:
> > > >
> > > > > Have you considered trying an existing document ingestion
> framework?
> > I
> > > > > wrote this one: https://github.com/nsoft/jesterj It already has a
> > > > database
> > > > > connector. If you do check it out and find difficulty please let me
> > > know
> > > > by
> > > > > leaving bug reports (if bug) or feedback (if confusion) in the
> > > > discussions
> > > > > section here: https://github.com/nsoft/jesterj/discussions
> > > > >
> > > > > As Mikhail noted, it's not easy to build a robust ingestion system
> > from
> > > > > scratch.
> > > > >
> > > > > -Gus
> > > > >
> > > > > On Fri, Dec 15, 2023 at 11:11 AM Dmitri Maziuk <
> > > dmitri.maz...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > On 12/15/23 05:41, Vince McMahon wrote:
> > > > > > > Ishan, you are right.  Doing multithreaded Indexing is going
> much
> > > > > faster.
> > > > > > > I found out after the remote machine became unresponsive very
> > > > quickly ;
> > > > > > it
> > > > > > > crashed.  lol.
> > > > > > FWIW I got better results posting docs in batches from a single
> > > thread.
> > > > > > Work is in a "private org" on gitlab so I can't post the link to
> > the
> > > > > > code, but the basic layout is a DB reader that yields rows and a
> > > writer
> > > > > > that does requests.post() of a list of JSON docs. With the DB
> row
> > ->
> > > > > > JSON doc transformer in-between.
> > > > > >
> > > > > > I played with the size of the batch as well as async/await queue
> > > before
> > > > > > leaving it single-threaded w/ batch size of 5K docs: I had no
> speed
> > > > > > advantage with larger batches in our setup. And it doesn't DDoS
> the
> > > > > > index. ;)
> > > > > >
> > > > > > Dima
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > http://www.needhamsoftware.com (work)
> > > > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > > > >
> > > >
> > >
> > >
> > > --
> > > http://www.needhamsoftware.com (work)
> > > https://a.co/d/b2sZLD9 (my fantasy fiction book)
> > >
> >
>

Reply via email to