Thanks for pointing that out. I work with relatively large data volumes. Last time when I use verbose logging via Solr UI, it crashed. What is the better way to gather the logs without the crashing?
On Fri, Dec 8, 2023 at 5:11 PM Mikhail Khludnev <m...@apache.org> wrote: > Vince, > Regardless of DIH, LogUpdateProcessorFactory > < > https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html#update-processor-factories-you-should-not-modify-or-remove > > > should log deleteQuery which wiped the docs. You can enable verbose logging > and find out what happend. > > On Fri, Dec 8, 2023 at 4:29 PM Vince McMahon < > sippingonesandze...@gmail.com> > wrote: > > > Hi, ufuk > > > > I was thinking along the same lines to broaden the tool of choice > > on handling delta-load. Flume looks like an interesting option. > > > > I'm so blessed to be working with so many smart and kind people in this > > mailing list. > > > > Thank you. Happy Friday. > > > > > > > > > > > > > > On Fri, Dec 8, 2023 at 1:48 AM ufuk yılmaz <uyil...@vivaldi.net.invalid> > > wrote: > > > > > Hi Vince, > > > > > > It shouldn’t take too much time to write a simple loop in your favorite > > > language which fetches rows from the db and sends them to Solr over > http > > to > > > /update handler. Imo It’s easier than trying to figure out DIH’s > > > particularities. Especially in the future, if you need to modify the > > > documents based on some logical conditions before indexing. > > > > > > If you don’t mind learning yet another tool, we used Apache Flume to > > index > > > data to Solr. It supports moving data from various sources into various > > > destinations. For your use case, maybe you can use sql as source and > > > MorphlineSolrSink as the destination (ctrl+f here: > > > https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.html) > > > There is an sql source plugin here which looks a bit old but may work: > > > https://github.com/keedio/flume-ng-sql-source > > > You can also write your own source plugin. Flume just helps with > > > guaranteed delivery, if you understand it’s way of working. > > > > > > I don’t know your business case but I’d prefer the first option most of > > > the time. > > > > > > -ufuk yilmaz > > > > > > — > > > > > > > On 8 Dec 2023, at 02:22, Vince McMahon < > sippingonesandze...@gmail.com> > > > wrote: > > > > > > > > Thanks, Shawn. > > > > > > > > DIH full-import, by itself works very well. It is bummer that my > > > > incremental load itself is into millions. When specifying batchSize > on > > > > data source, the delta-import will honor that batch size once, for > the > > > > first fetch, then will loop the rest by hundreds per sec. That > doesn't > > > > help getting all the Indexing done in a day for my need. > > > > > > > > I hope this finding may help the maintainer of the code to improve. > It > > > > took me days to realize it. > > > > > > > > Thanks, again. > > > > > > > > > > > > > > > > On Thu, Dec 7, 2023, 4:49 PM Shawn Heisey > <apa...@elyograg.org.invalid > > > > > > > wrote: > > > > > > > >>> On 12/7/23 07:56, Vince McMahon wrote: > > > >>> { > > > >>> "responseHeader": { > > > >>> "status": 0, > > > >>> "QTime": 0 > > > >>> }, > > > >>> "initArgs": [ > > > >>> "defaults", > > > >>> [ > > > >>> "config", > > > >>> "db-data-config.xml" > > > >>> ] > > > >>> ], > > > >>> "command": "status", > > > >>> "status": "idle", > > > >>> "importResponse": "", > > > >>> "statusMessages": { > > > >>> "Total Requests made to DataSource": "1", > > > >>> "Total Rows Fetched": "915000", > > > >>> "Total Documents Processed": "915000", > > > >>> "Total Documents Skipped": "0", > > > >>> "Full Dump Started": "2023-12-07 02:54:29", > > > >>> "": "Indexing completed. Added/Updated: 915000 documents. > Deleted > > > >>> 0 documents.", > > > >>> "Committed": "2023-12-07 02:54:51", > > > >>> "Time taken": "0:0:21.831" > > > >>> } > > > >>> } > > > >> > > > >> There's no way Solr can index 915000 docs in 21 seconds without a > LOT > > of > > > >> threads in the indexing program, and DIH is single-threaded. As > > you've > > > >> already noted, it didn't actually index most of the documents. I > > don't > > > >> have an answer as to why it didn't work. > > > >> > > > >> DIH lacks decent logging, error handling, and multi-threading. It > is > > > >> not the most reliable way to index. This is why it was deprecated a > > > >> while back and then removed from 9.x. You would be far better off > > > >> writing your own indexing program rather than using DIH. > > > >> > > > >> I have an idea for a multi-threaded database->solr indexing program, > > but > > > >> haven't had much time to spend on it. If I can ever get it done, it > > > >> will be freely available. > > > >> > > > >> On the entity, "rows" is not a valid attribute. To control how many > > DB > > > >> rows are fetched at a time, set batchSize on the dataSource element. > > > >> The default batchSize is 500. > > > >> > > > >> Thanks, > > > >> Shawn > > > >> > > > >> > > > > > > > > -- > Sincerely yours > Mikhail Khludnev >