RE: Speed up import of Hierarchical Data

2013-05-22 Thread O. Olson
Just an update for others reading this thread: I had some
CachedSqlEntityProcessor and had it addressed in the thread How do I use
CachedSqlEntityProcessor?
(http://lucene.472066.n3.nabble.com/How-do-I-use-CachedSqlEntityProcessor-td4064919.html)

I basically had to declare the child entities in the db-data-config.xml
like: 

entity name=Cat1  
query=SELECT CategoryName, SKU from CAT_TABLE WHERE
CategoryLevel=1 cacheKey=SKU cacheLookup=Product.SKU
processor=CachedSqlEntityProcessor
field column=CategoryName
name=Category1 / 
/entity

Thanks to James and others for their help.
O. O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4065400.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Speed up import of Hierarchical Data

2013-05-17 Thread Dyer, James
Using SqlEntityProcessor with cacheImpl=SortedMapBackedCache is the same as 
specifying CachedSqlEntityProcessor.  Because the pluggable caches are only 
partially committed, I never added details to the wiki, so it still refers to 
CachedSEP.  But its the same thing.

What is new here, though, is that you don't have to use SortedMapBackedCache 
(this is an in-memory cache and can only scale to what fits in heap.)  You can 
use an alternate cache (but none are included in the Solr distribution).  Also, 
you can cache data this doesn't come from SQL.  So its more flexible this way 
rather than the older CachedSEP.

Here's the wiki link with an example:  
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor 

James Dyer
Ingram Content Group
(615) 213-4311


-Original Message-
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Thursday, May 16, 2013 5:06 PM
To: solr-user@lucene.apache.org
Subject: RE: Speed up import of Hierarchical Data

Thank you James. Are there any examples of SortedMapBackedCache? I am new to
Solr and I do not find many tutorials in this regard. I just modified the
examples and they worked for me.  What is a good way to learn these basics?
O. O.



Dyer, James-2 wrote
 See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2
 DIH handlers.  The first would query the CAT_TABLE and save it to a
 disk-backed cache, using DIHCacheWriter.  You then would replace your 3
 child entities in the 2nd DIH handler to use DIHCacheProcessor to read
 back the cached data.  This is a little complicated to do, but it would
 let you just cache the data once and because it is disk-backed, will scale
 to whatever size the CAT_TABLE is.  (For some details, see this thread:
 http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)
 
 A simpler method is simply to specify cacheImpl=SortedMapBackedCache on
 the 3 child entities.  (This is the same as using
 CachedSqlEntityProcessor.)  It would generate 3 in-memory caches, each
 with the same data.  If CAT_TABLE is small, this would be adequate.  
 
 In between this would be to create a disk-backed cache Impl (or use the
 ones at SOLR-2613 or SOLR-2948) and specify it on cacheImpl.  It would
 still create 3 identical caches, but they would be disk-backed and could
 scale beyond what in-memory can handle.
 
 James Dyer
 Ingram Content Group
 (615) 213-4311





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html
Sent from the Solr - User mailing list archive at Nabble.com.




RE: Speed up import of Hierarchical Data

2013-05-17 Thread O. Olson
Thank you James. I think I got this to work using CachedSqlEntityProcessor –
and it seems extremely fast. I will try SortedMapBackedCache on Monday :-). 
Thank you,
O. O.



Dyer, James-2 wrote
 Using SqlEntityProcessor with cacheImpl=SortedMapBackedCache is the same
 as specifying CachedSqlEntityProcessor.  Because the pluggable caches
 are only partially committed, I never added details to the wiki, so it
 still refers to CachedSEP.  But its the same thing.
 
 What is new here, though, is that you don't have to use
 SortedMapBackedCache (this is an in-memory cache and can only scale to
 what fits in heap.)  You can use an alternate cache (but none are included
 in the Solr distribution).  Also, you can cache data this doesn't come
 from SQL.  So its more flexible this way rather than the older CachedSEP.
 
 Here's the wiki link with an example: 
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor 
 
 James Dyer
 Ingram Content Group
 (615) 213-4311





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064297.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Speed up import of Hierarchical Data

2013-05-16 Thread Stefan Matheis
That sounds like a perfect match for 
http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :)

On Thursday, May 16, 2013 at 6:01 PM, O. Olson wrote:

 I am using the DataImportHandler to Query a SQL Server and populate Solr.
 Unfortunately, SQL does not have an understanding of hierarchical
 relationships, and hence I use Table Joins. The following is an outline of
 my table structure:  
  
  
 PROD_TABLE
 - SKU (Primary Key)
 - Title (varchar)
 - Descr (varchar)
  
 CAT_TABLE
 - SKU (Foreign Key)
 - CategoryLevel (int i.e. 1, 2, 3 …)
 - CategoryName (varchar)
  
 I specify the SQL Query in the db-data-config.xml file – a snippet of which
 looks like:  
  
 dataConfig
 dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
 url=jdbc:sqlserver://localhost\/
 document
 entity name=Product  
 query=SELECT SKU, Title, Descr FROM PROD_TABLE
 field column=SKU name=SKU /
 field column=Title name=Title /
 field column=Descr name=Descr /
  
 entity name=Cat1  
 query=SELECT CategoryName from CAT_TABLE where
 SKU='${Product.SKU}' AND CategoryLevel=1
 field column=CategoryName name=Category1 /  
 /entity
 entity name=Cat2  
 query=SELECT CategoryName from CAT_TABLE where
 SKU='${Product.SKU}' AND CategoryLevel=2
 field column=CategoryName name=Category2 /  
 /entity
 entity name=Cat3  
 query=SELECT CategoryName from CAT_TABLE where
 SKU='${Product.SKU}' AND CategoryLevel=3
 field column=CategoryName name=Category3 /  
 /entity
  
 /entity
 /document
 /dataConfig
  
 It seems like the DataImportHandler handler sends out three or four queries
 for each Product. This results in a very slow import. Is there any way to
 speed this up? I would not mind an intermediate step of first extracting SQL
 and then putting it into Solr.
  
 Thank you for all your help.  
 O. O.
  
  
  
  
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
 Sent from the Solr - User mailing list archive at Nabble.com 
 (http://Nabble.com).
  
  




RE: Speed up import of Hierarchical Data

2013-05-16 Thread Dyer, James
See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2 DIH 
handlers.  The first would query the CAT_TABLE and save it to a disk-backed 
cache, using DIHCacheWriter.  You then would replace your 3 child entities in 
the 2nd DIH handler to use DIHCacheProcessor to read back the cached data.  
This is a little complicated to do, but it would let you just cache the data 
once and because it is disk-backed, will scale to whatever size the CAT_TABLE 
is.  (For some details, see this thread: 
http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)

A simpler method is simply to specify cacheImpl=SortedMapBackedCache on the 3 
child entities.  (This is the same as using CachedSqlEntityProcessor.)  It 
would generate 3 in-memory caches, each with the same data.  If CAT_TABLE is 
small, this would be adequate.  

In between this would be to create a disk-backed cache Impl (or use the ones at 
SOLR-2613 or SOLR-2948) and specify it on cacheImpl.  It would still create 3 
identical caches, but they would be disk-backed and could scale beyond what 
in-memory can handle.

James Dyer
Ingram Content Group
(615) 213-4311

-Original Message-
From: O. Olson [mailto:olson_...@yahoo.it] 
Sent: Thursday, May 16, 2013 11:01 AM
To: solr-user@lucene.apache.org
Subject: Speed up import of Hierarchical Data

I am using the DataImportHandler to Query a SQL Server and populate Solr.
Unfortunately, SQL does not have an understanding of hierarchical
relationships, and hence I use Table Joins. The following is an outline of
my table structure: 


PROD_TABLE
- SKU (Primary Key)
- Title  (varchar)
- Descr (varchar)

CAT_TABLE
- SKU (Foreign Key)
-  CategoryLevel (int i.e. 1, 2, 3 …)
- CategoryName  (varchar)

I specify the SQL Query in the db-data-config.xml file – a snippet of which
looks like: 

dataConfig
dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver
url=jdbc:sqlserver://localhost\/
document
entity name=Product 
query=SELECT SKU, Title, Descr FROM 
PROD_TABLE
field column=SKU name=SKU /
field column=Title name=Title /
field column=Descr name=Descr /

entity name=Cat1  
query=SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=1
field column=CategoryName name=Category1 
/ 
/entity
entity name=Cat2  
query=SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=2
field column=CategoryName name=Category2 
/ 
/entity
entity name=Cat3  
query=SELECT CategoryName from CAT_TABLE where
SKU='${Product.SKU}' AND CategoryLevel=3
field column=CategoryName name=Category3 
/ 
/entity

/entity
/document
/dataConfig

It seems like the DataImportHandler handler sends out three or four queries
for each Product. This results in a very slow import. Is there any way to
speed this up? I would not mind an intermediate step of first extracting SQL
and then putting it into Solr.

Thank you for all your help. 
O. O.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Speed up import of Hierarchical Data

2013-05-16 Thread O. Olson
Thank you Stefan. I am new to Solr and I would need to read up more on
CachedSqlEntityProcessor. Do you have any clue where to begin? There do not
seem to be any tutorials online.

The link you provided seems to have a very short and unclear explanation.
After “Example 1” you have “The usage is exactly same as the other one.”
What does “other one” refer to? I did not understand the description
completely.

This description seems to say that if the query is the same as a prior query
it would fetched from the cache. From my case each of the Category queries
are unique because they have a unique SKU and Category Level. Would
CachedSqlEntityProcessor then help me?

Thank you,
O. O.



Stefan Matheis-2 wrote
 That sounds like a perfect match for
 http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :)





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064034.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Speed up import of Hierarchical Data

2013-05-16 Thread O. Olson
Thank you James. Are there any examples of SortedMapBackedCache? I am new to
Solr and I do not find many tutorials in this regard. I just modified the
examples and they worked for me.  What is a good way to learn these basics?
O. O.



Dyer, James-2 wrote
 See https://issues.apache.org/jira/browse/SOLR-2943 .  You can set up 2
 DIH handlers.  The first would query the CAT_TABLE and save it to a
 disk-backed cache, using DIHCacheWriter.  You then would replace your 3
 child entities in the 2nd DIH handler to use DIHCacheProcessor to read
 back the cached data.  This is a little complicated to do, but it would
 let you just cache the data once and because it is disk-backed, will scale
 to whatever size the CAT_TABLE is.  (For some details, see this thread:
 http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html)
 
 A simpler method is simply to specify cacheImpl=SortedMapBackedCache on
 the 3 child entities.  (This is the same as using
 CachedSqlEntityProcessor.)  It would generate 3 in-memory caches, each
 with the same data.  If CAT_TABLE is small, this would be adequate.  
 
 In between this would be to create a disk-backed cache Impl (or use the
 ones at SOLR-2613 or SOLR-2948) and specify it on cacheImpl.  It would
 still create 3 identical caches, but they would be disk-backed and could
 scale beyond what in-memory can handle.
 
 James Dyer
 Ingram Content Group
 (615) 213-4311





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html
Sent from the Solr - User mailing list archive at Nabble.com.