RE: Speed up import of Hierarchical Data
Just an update for others reading this thread: I had some CachedSqlEntityProcessor and had it addressed in the thread How do I use CachedSqlEntityProcessor? (http://lucene.472066.n3.nabble.com/How-do-I-use-CachedSqlEntityProcessor-td4064919.html) I basically had to declare the child entities in the db-data-config.xml like: entity name=Cat1 query=SELECT CategoryName, SKU from CAT_TABLE WHERE CategoryLevel=1 cacheKey=SKU cacheLookup=Product.SKU processor=CachedSqlEntityProcessor field column=CategoryName name=Category1 / /entity Thanks to James and others for their help. O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4065400.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Speed up import of Hierarchical Data
Using SqlEntityProcessor with cacheImpl=SortedMapBackedCache is the same as specifying CachedSqlEntityProcessor. Because the pluggable caches are only partially committed, I never added details to the wiki, so it still refers to CachedSEP. But its the same thing. What is new here, though, is that you don't have to use SortedMapBackedCache (this is an in-memory cache and can only scale to what fits in heap.) You can use an alternate cache (but none are included in the Solr distribution). Also, you can cache data this doesn't come from SQL. So its more flexible this way rather than the older CachedSEP. Here's the wiki link with an example: http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: O. Olson [mailto:olson_...@yahoo.it] Sent: Thursday, May 16, 2013 5:06 PM To: solr-user@lucene.apache.org Subject: RE: Speed up import of Hierarchical Data Thank you James. Are there any examples of SortedMapBackedCache? I am new to Solr and I do not find many tutorials in this regard. I just modified the examples and they worked for me. What is a good way to learn these basics? O. O. Dyer, James-2 wrote See https://issues.apache.org/jira/browse/SOLR-2943 . You can set up 2 DIH handlers. The first would query the CAT_TABLE and save it to a disk-backed cache, using DIHCacheWriter. You then would replace your 3 child entities in the 2nd DIH handler to use DIHCacheProcessor to read back the cached data. This is a little complicated to do, but it would let you just cache the data once and because it is disk-backed, will scale to whatever size the CAT_TABLE is. (For some details, see this thread: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html) A simpler method is simply to specify cacheImpl=SortedMapBackedCache on the 3 child entities. (This is the same as using CachedSqlEntityProcessor.) It would generate 3 in-memory caches, each with the same data. If CAT_TABLE is small, this would be adequate. In between this would be to create a disk-backed cache Impl (or use the ones at SOLR-2613 or SOLR-2948) and specify it on cacheImpl. It would still create 3 identical caches, but they would be disk-backed and could scale beyond what in-memory can handle. James Dyer Ingram Content Group (615) 213-4311 -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Speed up import of Hierarchical Data
Thank you James. I think I got this to work using CachedSqlEntityProcessor – and it seems extremely fast. I will try SortedMapBackedCache on Monday :-). Thank you, O. O. Dyer, James-2 wrote Using SqlEntityProcessor with cacheImpl=SortedMapBackedCache is the same as specifying CachedSqlEntityProcessor. Because the pluggable caches are only partially committed, I never added details to the wiki, so it still refers to CachedSEP. But its the same thing. What is new here, though, is that you don't have to use SortedMapBackedCache (this is an in-memory cache and can only scale to what fits in heap.) You can use an alternate cache (but none are included in the Solr distribution). Also, you can cache data this doesn't come from SQL. So its more flexible this way rather than the older CachedSEP. Here's the wiki link with an example: http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor James Dyer Ingram Content Group (615) 213-4311 -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064297.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Speed up import of Hierarchical Data
That sounds like a perfect match for http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :) On Thursday, May 16, 2013 at 6:01 PM, O. Olson wrote: I am using the DataImportHandler to Query a SQL Server and populate Solr. Unfortunately, SQL does not have an understanding of hierarchical relationships, and hence I use Table Joins. The following is an outline of my table structure: PROD_TABLE - SKU (Primary Key) - Title (varchar) - Descr (varchar) CAT_TABLE - SKU (Foreign Key) - CategoryLevel (int i.e. 1, 2, 3 …) - CategoryName (varchar) I specify the SQL Query in the db-data-config.xml file – a snippet of which looks like: dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost\/ document entity name=Product query=SELECT SKU, Title, Descr FROM PROD_TABLE field column=SKU name=SKU / field column=Title name=Title / field column=Descr name=Descr / entity name=Cat1 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=1 field column=CategoryName name=Category1 / /entity entity name=Cat2 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=2 field column=CategoryName name=Category2 / /entity entity name=Cat3 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=3 field column=CategoryName name=Category3 / /entity /entity /document /dataConfig It seems like the DataImportHandler handler sends out three or four queries for each Product. This results in a very slow import. Is there any way to speed this up? I would not mind an intermediate step of first extracting SQL and then putting it into Solr. Thank you for all your help. O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html Sent from the Solr - User mailing list archive at Nabble.com (http://Nabble.com).
RE: Speed up import of Hierarchical Data
See https://issues.apache.org/jira/browse/SOLR-2943 . You can set up 2 DIH handlers. The first would query the CAT_TABLE and save it to a disk-backed cache, using DIHCacheWriter. You then would replace your 3 child entities in the 2nd DIH handler to use DIHCacheProcessor to read back the cached data. This is a little complicated to do, but it would let you just cache the data once and because it is disk-backed, will scale to whatever size the CAT_TABLE is. (For some details, see this thread: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html) A simpler method is simply to specify cacheImpl=SortedMapBackedCache on the 3 child entities. (This is the same as using CachedSqlEntityProcessor.) It would generate 3 in-memory caches, each with the same data. If CAT_TABLE is small, this would be adequate. In between this would be to create a disk-backed cache Impl (or use the ones at SOLR-2613 or SOLR-2948) and specify it on cacheImpl. It would still create 3 identical caches, but they would be disk-backed and could scale beyond what in-memory can handle. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: O. Olson [mailto:olson_...@yahoo.it] Sent: Thursday, May 16, 2013 11:01 AM To: solr-user@lucene.apache.org Subject: Speed up import of Hierarchical Data I am using the DataImportHandler to Query a SQL Server and populate Solr. Unfortunately, SQL does not have an understanding of hierarchical relationships, and hence I use Table Joins. The following is an outline of my table structure: PROD_TABLE - SKU (Primary Key) - Title (varchar) - Descr (varchar) CAT_TABLE - SKU (Foreign Key) - CategoryLevel (int i.e. 1, 2, 3 …) - CategoryName (varchar) I specify the SQL Query in the db-data-config.xml file – a snippet of which looks like: dataConfig dataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://localhost\/ document entity name=Product query=SELECT SKU, Title, Descr FROM PROD_TABLE field column=SKU name=SKU / field column=Title name=Title / field column=Descr name=Descr / entity name=Cat1 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=1 field column=CategoryName name=Category1 / /entity entity name=Cat2 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=2 field column=CategoryName name=Category2 / /entity entity name=Cat3 query=SELECT CategoryName from CAT_TABLE where SKU='${Product.SKU}' AND CategoryLevel=3 field column=CategoryName name=Category3 / /entity /entity /document /dataConfig It seems like the DataImportHandler handler sends out three or four queries for each Product. This results in a very slow import. Is there any way to speed this up? I would not mind an intermediate step of first extracting SQL and then putting it into Solr. Thank you for all your help. O. O. -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Speed up import of Hierarchical Data
Thank you Stefan. I am new to Solr and I would need to read up more on CachedSqlEntityProcessor. Do you have any clue where to begin? There do not seem to be any tutorials online. The link you provided seems to have a very short and unclear explanation. After “Example 1” you have “The usage is exactly same as the other one.” What does “other one” refer to? I did not understand the description completely. This description seems to say that if the query is the same as a prior query it would fetched from the cache. From my case each of the Category queries are unique because they have a unique SKU and Category Level. Would CachedSqlEntityProcessor then help me? Thank you, O. O. Stefan Matheis-2 wrote That sounds like a perfect match for http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor :) -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064034.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Speed up import of Hierarchical Data
Thank you James. Are there any examples of SortedMapBackedCache? I am new to Solr and I do not find many tutorials in this regard. I just modified the examples and they worked for me. What is a good way to learn these basics? O. O. Dyer, James-2 wrote See https://issues.apache.org/jira/browse/SOLR-2943 . You can set up 2 DIH handlers. The first would query the CAT_TABLE and save it to a disk-backed cache, using DIHCacheWriter. You then would replace your 3 child entities in the 2nd DIH handler to use DIHCacheProcessor to read back the cached data. This is a little complicated to do, but it would let you just cache the data once and because it is disk-backed, will scale to whatever size the CAT_TABLE is. (For some details, see this thread: http://lucene.472066.n3.nabble.com/DIH-nested-entities-don-t-work-tt4015514.html) A simpler method is simply to specify cacheImpl=SortedMapBackedCache on the 3 child entities. (This is the same as using CachedSqlEntityProcessor.) It would generate 3 in-memory caches, each with the same data. If CAT_TABLE is small, this would be adequate. In between this would be to create a disk-backed cache Impl (or use the ones at SOLR-2613 or SOLR-2948) and specify it on cacheImpl. It would still create 3 identical caches, but they would be disk-backed and could scale beyond what in-memory can handle. James Dyer Ingram Content Group (615) 213-4311 -- View this message in context: http://lucene.472066.n3.nabble.com/Speed-up-import-of-Hierarchical-Data-tp4063924p4064040.html Sent from the Solr - User mailing list archive at Nabble.com.