Re: Fastest way to index data to solr

2022-09-30 Thread Joel Bernstein
Unless something has changed recently, you will have a memory leak if you don't atleast soft commit during the load. This is due to the in-memory tlog data used for real-time get. This in-memory tlog data is released when a new searcher is opened. So, if you're having memory issues while bulk

Re: Fastest way to index data to solr

2022-09-30 Thread Andy Lester
I can’t imagine a case where the speed in parsing the input data won’t be dwarfed by the time spent on everything else. You’re talking about an in-memory operation that does a ton of I/O. It’s not going to make a noticeable difference once way or the other. > I have a followup question. Is

Re: Fastest way to index data to solr

2022-09-30 Thread Dave
I don’t have any tests but I know anything is faster than xml. You may as well stick to text files. Xml is garbage that’s why they made yaml which is the parent of json > On Sep 30, 2022, at 3:47 AM, Thomas Corthals wrote: > > Hi Gus, > > I have a followup question. Is JSON parsed faster

Re: Fastest way to index data to solr

2022-09-30 Thread Thomas Corthals
Hi Gus, I have a followup question. Is JSON parsed faster than XML by Solr if they represent the exact same documents? Thomas Op vr 30 sep. 2022 om 06:58 schreef Gus Heck : > If you are using a non-java language you can use JSON. >

Re: Fastest way to index data to solr

2022-09-29 Thread Gus Heck
70 million can be a lot or a little. Doc count is not even half the story. How much storage space do these documents occupy in the database? Is the text tweet sized, or multi-megabyte sized clobs, or links files on a file store that need to be fetched and parsed (or OCR'd or converted from

Re: Fastest way to index data to solr

2022-09-29 Thread Shawn Heisey
On 9/29/22 22:28, Gus Heck wrote: * Do NOT commit during the bulk load, wait until the end Unless something changed this is slightly risky. It can lead to very large transaction logs and very long playback of the tx log on startup. It is always good practice to have autoCommit configured with

Re: Fastest way to index data to solr

2022-09-29 Thread Gus Heck
> > * Do NOT commit during the bulk load, wait until the end > Unless something changed this is slightly risky. It can lead to very large transaction logs and very long playback of the tx log on startup. If Solr goes down during indexing to something like an OOM, it could take a very long time

Re: Fastest way to index data to solr

2022-09-29 Thread Dave
Another way to handle this is have your indexing code fork out to as many cores as the solr indexing server has. It’s way less work to force the code to run itself that many times in parallel, and as long as your sql queries and said tables are properly indexed the database shouldn’t be a

Re: Fastest way to index data to solr

2022-09-29 Thread Andy Lester
> On Sep 29, 2022, at 4:17 AM, Jan Høydahl wrote: > > * Index with multiple threads on the client, experiment to find a good number > based on the number of CPUs on receiving side That may also mean having multiple clients. We went from taking about 8 hours to index our entire 42M rows to

Re: Fastest way to index data to solr

2022-09-29 Thread Jan Høydahl
Hi, If you want to index fast you shold * Make sure you have enough hardware on the solr side to handle the bulk load * Index with multiple threads on the client, experiment to find a good number based on the number of CPUs on receiving side * If using JAVA on client, use CloudSolrClient which

Fastest way to index data to solr

2022-09-29 Thread Shankar R
Hi, We are having nearly 70-80 millions of data which need to be indexed in solr 8.6.1. We want to choose between Java BInary format or direct JSON format. Our source data is DBMS which is a structured data. Regards Ravi