Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs
Thanks For the help. Looks like I've managed to get some semblance of this working. The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it normal to see around 6GB of RAM usage? (My test is indexing 250,000 records with the 50 child entities) In short, I'm running through a loop against a DB 50 times (to mimic 50 entities) and adding the results to a Map, then using that map to loop through and commit values to Solr. Jörn Franke wrote > Ideally you use scripts that can use JVM/Java - in this way you can always > use the latest SolrJ client library but also other libraries that are > relevant (eg Tika for unstructured content). > This does not have to be Java directly but can be based also on Scala or > JVM script languages, such as Groovy. > > There are also wrappers for Python etc, but those may not always leverage > the latest version of the library. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs
Thanks again. I guess I'll have to start researching how to create such custom indexing scripts and determine which language would be best based on the environment I'm using (Azure in this case). Appreciate the help greatly Charlie Hull-3 wrote > On 05/07/2019 14:33, Joseph_Tucker wrote: >> Thanks for your help / suggestion. >> >> I'm not sure I completely follow in this case. >> SolrJ looks like a method to allow Java applications to talk to Solr, or >> any >> other third party application would simply be a communication method >> between >> Solr and the language of your choosing. >> >> I guess what I'm after is, how would using SolrJ improve performance when >> indexing? > > It's not just about improving performance (although DIH is single > threaded, so you could obtain a marked indexing performance gain using a > client such as SolrJ). With DIH you will embed a lot of SQL code into > Solr's configuration files, and the more sources you add the more > complicated, hard to debug and unmaintainable it's going to be. You > should thus consider writing a proper indexing script in Java, Python or > whatever language you are most familiar with - this has always been our > approach. > > Best > > > Charlie > >> >> *** I could be wrong in my assumptions as I'm still learning a great deal >> about Solr. *** >> >> I appreciate your help >> >> Regards, >> >> Joe >> >> >> >> -- >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html > > > -- > Charlie Hull > Flax - Open Source Enterprise Search > > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > web: www.flax.co.uk -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs
Thanks for your help / suggestion. I'm not sure I completely follow in this case. SolrJ looks like a method to allow Java applications to talk to Solr, or any other third party application would simply be a communication method between Solr and the language of your choosing. I guess what I'm after is, how would using SolrJ improve performance when indexing? *** I could be wrong in my assumptions as I'm still learning a great deal about Solr. *** I appreciate your help Regards, Joe -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr 6.6.0 - DIH - Multiple entities - Multiple DBs
What is the best way - performance wise - to index data from multiple databases? I'm potentially going to have around 50 different data sources grabbing unique data Here's what I've roughly designed: ... I've excluded fields but each entity would have a number of fields within. The issue I'm seeing here is the full-index is exceedingly slow. Is there a better way to go about this? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr 6.6.0 - Multiple DataSources - Performance / Delta Issues - MSSQL(Azure)
I've currently got a data configuration that uses multiple dataSources. I have a main dataSource that contains shared inventory data, and individual dataSources that contain price data that differs from database to database. (I have little to no say in how the Databases can be structured) The scenario is: I have multiple shops (x amount, but for the sake of this example, say 10 shops) Each shop will contain the same inventory data about products. However, each shop will contain different price data per product. Example: Shop1 has Chocolate for $1 and Shop2 has Chocolate for $0.95 My configuration looks something like this: ... ... A few issues I've noticed when testing this on a local machine. 1) Performance on full-indexes degrades with each price entity that I add. With only three prices, I'm seeing indexing take as slow as 25 records per second. Is there a better way to go about gathering the price data? 2)When performing Deltas, I cannot use dataimporter.last_index_time as I do not have anything to compare to (at least not that I'm immediately aware of). I have a table that I've been able to use that contains a column called "LastTime" of the type BigInt. I use this column to update with the global variable @@DBTS after each Full-Index and after each deltaQuery i.e. query=" DECLARE @LatestUpdate AS bigint; SET @LatestUpdate = (SELECT @@DBTS); Select ... <- main select to get all the data UPDATE [SolrQueue] SET [LastTime] = @LatestUpdate FROM [SolrQueue] " ^ similar in deltaQuery I have a parentDeltaQuery in each price entity that is sending the product IDs back to the root entity ( select id from Products where id = '${price2.id}' ) The issue that comes up here, is when the delta is running, I get a table lock . Is there a better method to retrieve what prices have changed? Any assistance would be greatly appreciated. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Solr 6.6.0 - Indexing Multiple DataSources multiple child entities
[Using Solr 6.6.0] I've currently got a few databases that I'm indexing. To give the scenario: I have 10 different shops Each shop will have the same inventory, but different price tags on each item. (i.e. Shop 1 sells Chocolate for $1, and Shop 2 sells Chocolate for $0.95... etc) I'm connecting to an SQL Database for the Inventory information and a separate Database for each individual Shop price information (I don't have much control over how the database is structured) The way my db-config.xml file is structured is something like this: ... a few problems I'm running into... Firstly: I'm seeing a really slow index the more shops I add. Is there a better way to go about this? Secondly: how can I ensure I get prices updated if the only DB that changes when I run a delta is Shop3 ? ... etc. Thirdly: I can't seem to use dataimporter.last_index_time as there is no "last updated" on the database. I have a separate table that stores the @@DBTS (from mssql) after each full-import *or* delta-import. The problem is, I need to run this on each DB and as such, each entity under the root entity, which can cause lock issues for each sql update that's run. This is far from efficient, far from best practices... however I'm not sure of a better way to go about this. Any help will be more than appreciated Thanks Joe -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html