Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-11 Thread Joseph_Tucker
Thanks For the help.

Looks like I've managed to get some semblance of this working. 
The indexes are much faster, but the RAM usage by SolrJ is quite high. Is it
normal to see around 6GB of RAM usage?
(My test is indexing 250,000 records with the 50 child entities)

In short, I'm running through a loop against a DB 50 times (to mimic 50
entities) and adding the results to a Map, then using that map to loop
through and commit values to Solr.


Jörn Franke wrote
> Ideally you use scripts that can use JVM/Java - in this way you can always
> use the latest SolrJ client library but also other libraries that are
> relevant (eg Tika for unstructured content).
> This does not have to be Java directly but can be based also on Scala or
> JVM script languages, such as Groovy.
> 
> There are also wrappers for Python etc, but those may not always leverage
> the latest version of the library.





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-08 Thread Joseph_Tucker
Thanks again.

I guess I'll have to start researching how to create such custom indexing
scripts and determine which language would be best based on the environment
I'm using (Azure in this case). 

Appreciate the help greatly 




Charlie Hull-3 wrote
> On 05/07/2019 14:33, Joseph_Tucker wrote:
>> Thanks for your help / suggestion.
>>
>> I'm not sure I completely follow in this case.
>> SolrJ looks like a method to allow Java applications to talk to Solr, or
>> any
>> other third party application would simply be a communication method
>> between
>> Solr and the language of your choosing.
>>
>> I guess what I'm after is, how would using SolrJ improve performance when
>> indexing?
> 
> It's not just about improving performance (although DIH is single 
> threaded, so you could obtain a marked indexing performance gain using a 
> client such as SolrJ).  With DIH you will embed a lot of SQL code into 
> Solr's configuration files, and the more sources you add the more 
> complicated, hard to debug and unmaintainable it's going to be. You 
> should thus consider writing a proper indexing script in Java, Python or 
> whatever language you are most familiar with - this has always been our 
> approach.
> 
> Best
> 
> 
> Charlie
> 
>>
>> *** I could be wrong in my assumptions as I'm still learning a great deal
>> about Solr. ***
>>
>> I appreciate your help
>>
>> Regards,
>>
>> Joe
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 
> 
> -- 
> Charlie Hull
> Flax - Open Source Enterprise Search
> 
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-05 Thread Joseph_Tucker
Thanks for your help / suggestion.

I'm not sure I completely follow in this case.
SolrJ looks like a method to allow Java applications to talk to Solr, or any
other third party application would simply be a communication method between
Solr and the language of your choosing. 

I guess what I'm after is, how would using SolrJ improve performance when
indexing?

*** I could be wrong in my assumptions as I'm still learning a great deal
about Solr. ***

I appreciate your help 

Regards,

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 6.6.0 - DIH - Multiple entities - Multiple DBs

2019-07-05 Thread Joseph_Tucker
What is the best way - performance wise - to index data from multiple
databases?
I'm potentially going to have around 50 different data sources grabbing
unique data
Here's what I've roughly designed:


 
 
 
 
 
 
 
 ... 
 


I've excluded fields but each entity would have a number of fields within.
The issue I'm seeing here is the full-index is exceedingly slow. Is there a
better way to go about this?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 6.6.0 - Multiple DataSources - Performance / Delta Issues - MSSQL(Azure)

2019-06-26 Thread Joseph_Tucker
I've currently got a data configuration that uses multiple dataSources.
I have a main dataSource that contains shared inventory data, and individual
dataSources that contain price data that differs from database to database.
(I have little to no say in how the Databases can be structured)

The scenario is: I have multiple shops (x amount, but for the sake of this
example, say 10 shops)
Each shop will contain the same inventory data about products. However, each
shop will contain different price data per product. 
Example: Shop1 has Chocolate for $1 and Shop2 has Chocolate for $0.95

My configuration looks something like this:



 ...
  
   
  
  
   
  
  
   
  
  ...


A few issues I've noticed when testing this on a local machine.
1) Performance on full-indexes degrades with each price entity that I add.
With only three prices, I'm seeing indexing take as slow as 25 records per
second.
Is there a better way to go about gathering the price data?

2)When performing Deltas, I cannot use dataimporter.last_index_time as I do
not have anything to compare to (at least not that I'm immediately aware
of). 
I have a table that I've been able to use that contains a column called
"LastTime" of the type BigInt. 
I use this column to update with the global variable @@DBTS after each
Full-Index and after each deltaQuery
i.e.
query="
DECLARE @LatestUpdate AS bigint;
SET @LatestUpdate = (SELECT @@DBTS);

Select ... <- main select to get all the data

UPDATE [SolrQueue]
SET [LastTime] = @LatestUpdate
FROM [SolrQueue] "

^ similar in deltaQuery


I have a parentDeltaQuery in each price entity that is sending the product
IDs back to the root entity 
( select id from Products where id = '${price2.id}' )

The issue that comes up here, is when the delta is running, I get a table
lock . Is there a better method to retrieve what prices have changed?


Any assistance would be greatly appreciated.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr 6.6.0 - Indexing Multiple DataSources multiple child entities

2019-06-26 Thread Joseph_Tucker
[Using Solr 6.6.0]

I've currently got a few databases that I'm indexing.
To give the scenario: I have 10 different shops
Each shop will have the same inventory, but different price tags on each
item.
(i.e. Shop 1 sells Chocolate for $1, and Shop 2 sells Chocolate for $0.95...
etc)
I'm connecting to an SQL Database for the Inventory information and a
separate Database for each individual Shop price information (I don't have
much control over how the database is structured)

The way my db-config.xml file is structured is something like this:


 
   ...
 
   
  
 
   
  
 
   
  



a few problems I'm running into...

Firstly: I'm seeing a really slow index the more shops I add. Is there a
better way to go about this?

Secondly: how can I ensure I get prices updated if the only DB that changes
when I run a delta is Shop3 ? ... etc.

Thirdly: I can't seem to use dataimporter.last_index_time as there is no
"last updated" on the database. 
I have a separate table that stores the @@DBTS (from mssql) after each
full-import *or* delta-import. 

The problem is, I need to run this on each DB and as such, each entity under
the root entity, which can cause lock issues for each sql update that's run.
This is far from efficient, far from best practices...  however I'm not sure
of a better way to go about this.

Any help will be more than appreciated

Thanks

Joe






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html