Thanks for pointing that out.

I work with relatively large data volumes.  Last time when I use verbose
logging via Solr UI, it crashed.  What is the better way to gather the logs
without the crashing?


On Fri, Dec 8, 2023 at 5:11 PM Mikhail Khludnev <m...@apache.org> wrote:

> Vince,
> Regardless of DIH, LogUpdateProcessorFactory
> <
> https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html#update-processor-factories-you-should-not-modify-or-remove
> >
> should log deleteQuery which wiped the docs. You can enable verbose logging
> and find out what happend.
>
> On Fri, Dec 8, 2023 at 4:29 PM Vince McMahon <
> sippingonesandze...@gmail.com>
> wrote:
>
> > Hi,  ufuk
> >
> > I was thinking along the same lines to broaden the tool of choice
> > on handling delta-load.  Flume looks like an interesting option.
> >
> > I'm so blessed to be working with so many smart and kind people in this
> > mailing list.
> >
> > Thank you.  Happy Friday.
> >
> >
> >
> >
> >
> >
> > On Fri, Dec 8, 2023 at 1:48 AM ufuk yılmaz <uyil...@vivaldi.net.invalid>
> > wrote:
> >
> > > Hi Vince,
> > >
> > > It shouldn’t take too much time to write a simple loop in your favorite
> > > language which fetches rows from the db and sends them to Solr over
> http
> > to
> > > /update handler. Imo It’s easier than trying to figure out DIH’s
> > > particularities. Especially in the future, if you need to modify the
> > > documents based on some logical conditions before indexing.
> > >
> > > If you don’t mind learning yet another tool, we used Apache Flume to
> > index
> > > data to Solr. It supports moving data from various sources into various
> > > destinations. For your use case, maybe you can use sql as source and
> > > MorphlineSolrSink as the destination (ctrl+f here:
> > > https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.html)
> > > There is an sql source plugin here which looks a bit old but may work:
> > > https://github.com/keedio/flume-ng-sql-source
> > > You can also write your own source plugin. Flume just helps with
> > > guaranteed delivery, if you understand it’s way of working.
> > >
> > > I don’t know your business case but I’d prefer the first option most of
> > > the time.
> > >
> > > -ufuk yilmaz
> > >
> > > —
> > >
> > > > On 8 Dec 2023, at 02:22, Vince McMahon <
> sippingonesandze...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks, Shawn.
> > > >
> > > > DIH full-import, by itself works very well.  It is bummer that my
> > > > incremental load itself is into millions.  When specifying batchSize
> on
> > > > data source, the delta-import will honor that batch size once, for
> the
> > > > first fetch, then will loop the rest by hundreds per sec.  That
> doesn't
> > > > help getting all the Indexing done in a day for my need.
> > > >
> > > > I hope this finding may help the maintainer of the code to improve.
> It
> > > > took me days to realize it.
> > > >
> > > > Thanks, again.
> > > >
> > > >
> > > >
> > > > On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey
> <apa...@elyograg.org.invalid
> > >
> > > > wrote:
> > > >
> > > >>> On 12/7/23 07:56, Vince McMahon wrote:
> > > >>> {
> > > >>>   "responseHeader": {
> > > >>>     "status": 0,
> > > >>>     "QTime": 0
> > > >>>   },
> > > >>>   "initArgs": [
> > > >>>     "defaults",
> > > >>>     [
> > > >>>       "config",
> > > >>>       "db-data-config.xml"
> > > >>>     ]
> > > >>>   ],
> > > >>>   "command": "status",
> > > >>>   "status": "idle",
> > > >>>   "importResponse": "",
> > > >>>   "statusMessages": {
> > > >>>     "Total Requests made to DataSource": "1",
> > > >>>     "Total Rows Fetched": "915000",
> > > >>>     "Total Documents Processed": "915000",
> > > >>>     "Total Documents Skipped": "0",
> > > >>>     "Full Dump Started": "2023-12-07 02:54:29",
> > > >>>     "": "Indexing completed. Added/Updated: 915000 documents.
> Deleted
> > > >>> 0 documents.",
> > > >>>     "Committed": "2023-12-07 02:54:51",
> > > >>>     "Time taken": "0:0:21.831"
> > > >>>   }
> > > >>> }
> > > >>
> > > >> There's no way Solr can index 915000 docs in 21 seconds without a
> LOT
> > of
> > > >> threads in the indexing program, and DIH is single-threaded.  As
> > you've
> > > >> already noted, it didn't actually index most of the documents.  I
> > don't
> > > >> have an answer as to why it didn't work.
> > > >>
> > > >> DIH lacks decent logging, error handling, and multi-threading.  It
> is
> > > >> not the most reliable way to index.  This is why it was deprecated a
> > > >> while back and then removed from 9.x.  You would be far better off
> > > >> writing your own indexing program rather than using DIH.
> > > >>
> > > >> I have an idea for a multi-threaded database->solr indexing program,
> > but
> > > >> haven't had much time to spend on it.  If I can ever get it done, it
> > > >> will be freely available.
> > > >>
> > > >> On the entity, "rows" is not a valid attribute.  To control how many
> > DB
> > > >> rows are fetched at a time, set batchSize on the dataSource element.
> > > >> The default batchSize is 500.
> > > >>
> > > >> Thanks,
> > > >> Shawn
> > > >>
> > > >>
> > >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Reply via email to