AW: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Hi, thanks for the other ideas. I worked around the problem with the idea of Paul Noble. This is working really fine for me right now. My full-import is at around 40 minutes and my delta-import runs in less than 10 seconds, because it runs every minute. So that configuration seems to be pretty optimal for my set up. Idea 1: Will try it out some time soon Idea2: Tried that one. But that slows down the full-import in my case too. One table I use the cache for has more rows than the root entities table. So it has multiple rows per row of the root entity. So a cache that is being build up during the import does not help here since caches rows are only used once. The only benefit I have from the cache is through the prefilling. The prefilling is a lot faster than reading the rows on demand. Idea3: Probably would also be slower than the current configuration since the prefilling takes around 2 minutes. Since my delta-import currently runs every minute that would not make sense. Thanks and Regards Constantin -Ursprüngliche Nachricht- Von: Dyer, James [mailto:james.d...@ingramcontent.com] Gesendet: Donnerstag, 20. Juni 2013 18:51 An: solr-user@lucene.apache.org Betreff: RE: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor Instead of specifying CachedSqlEntityProcessor, you can specify SqlEntityProcessor with cacheImpl='SortedMapBackedCache'. If you parametertize this, to have SortedMapBackedCache for full updates but blank for deltas I think it will cache only on the full import. Another option is to parameterize the child queries with a where clause, so if it is creating a new cache with every row, the cache will only contain the data needed for that child row. A third option is to do your delta imports like described here: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport My experience is that this generally performs better than using the delta import feature anyhow. The trick is on handling deletes, which will require its own entity and the $deleteDocById command. See http://wiki.apache.org/solr/DataImportHandler#Special_Commands But these are all workarounds. This sounds like a bug or some subtle configuration problem. I looked through the JIRA issues and did not see anything like this reported yet, but if you're pretty sure you are doing everything correctly you may want to open a bug ticket. Be sure to flag it as contrib - Dataimporthandler. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Constantin Wolber [mailto:constantin.wol...@medicalcolumbus.de] Sent: Thursday, June 20, 2013 3:21 AM To: solr-user@lucene.apache.org Subject: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior
DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber
Re: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
it is possible to create two separate root entities . one for full-import and another for delta. for the delta-import you can skip Cache that way On Thu, Jun 20, 2013 at 1:50 PM, Constantin Wolber constantin.wol...@medicalcolumbus.de wrote: Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber -- - Noble Paul
AW: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Hi, and thanks for the answer. But I'm a little bit confused about what you are suggesting. I did not really use the rootEntity attribute before. But from what I read in the documentation as far as I can tell that would result in two documents (maybe with the same id which would probably result in only one document being stored) because one for each root entity. It would be great if you could just sketch the setup with the entities I provided. Because currently I have no idea on how to do it. Regards Constantin -Ursprüngliche Nachricht- Von: Noble Paul നോബിള് नोब्ळ् [mailto:noble.p...@gmail.com] Gesendet: Donnerstag, 20. Juni 2013 15:42 An: solr-user@lucene.apache.org Betreff: Re: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor it is possible to create two separate root entities . one for full-import and another for delta. for the delta-import you can skip Cache that way On Thu, Jun 20, 2013 at 1:50 PM, Constantin Wolber constantin.wol...@medicalcolumbus.de wrote: Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber -- - Noble Paul
AW: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Hi, i may have been a little to fast with my response. After reading a bit more I imagine you meant running the full-import with the entity param for the root entity for full import. And running the delta import with the entity param for the delta entity. Is that correct? Regards Constantin -Ursprüngliche Nachricht- Von: Constantin Wolber [mailto:constantin.wol...@medicalcolumbus.de] Gesendet: Donnerstag, 20. Juni 2013 16:42 An: solr-user@lucene.apache.org Betreff: AW: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor Hi, and thanks for the answer. But I'm a little bit confused about what you are suggesting. I did not really use the rootEntity attribute before. But from what I read in the documentation as far as I can tell that would result in two documents (maybe with the same id which would probably result in only one document being stored) because one for each root entity. It would be great if you could just sketch the setup with the entities I provided. Because currently I have no idea on how to do it. Regards Constantin -Ursprüngliche Nachricht- Von: Noble Paul നോബിള് नोब्ळ् [mailto:noble.p...@gmail.com] Gesendet: Donnerstag, 20. Juni 2013 15:42 An: solr-user@lucene.apache.org Betreff: Re: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor it is possible to create two separate root entities . one for full-import and another for delta. for the delta-import you can skip Cache that way On Thu, Jun 20, 2013 at 1:50 PM, Constantin Wolber constantin.wol...@medicalcolumbus.de wrote: Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber -- - Noble Paul
Re: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
yes. that's right On Thu, Jun 20, 2013 at 8:16 PM, Constantin Wolber constantin.wol...@medicalcolumbus.de wrote: Hi, i may have been a little to fast with my response. After reading a bit more I imagine you meant running the full-import with the entity param for the root entity for full import. And running the delta import with the entity param for the delta entity. Is that correct? Regards Constantin -Ursprüngliche Nachricht- Von: Constantin Wolber [mailto:constantin.wol...@medicalcolumbus.de] Gesendet: Donnerstag, 20. Juni 2013 16:42 An: solr-user@lucene.apache.org Betreff: AW: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor Hi, and thanks for the answer. But I'm a little bit confused about what you are suggesting. I did not really use the rootEntity attribute before. But from what I read in the documentation as far as I can tell that would result in two documents (maybe with the same id which would probably result in only one document being stored) because one for each root entity. It would be great if you could just sketch the setup with the entities I provided. Because currently I have no idea on how to do it. Regards Constantin -Ursprüngliche Nachricht- Von: Noble Paul നോബിള് नोब्ळ् [mailto:noble.p...@gmail.com] Gesendet: Donnerstag, 20. Juni 2013 15:42 An: solr-user@lucene.apache.org Betreff: Re: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor it is possible to create two separate root entities . one for full-import and another for delta. for the delta-import you can skip Cache that way On Thu, Jun 20, 2013 at 1:50 PM, Constantin Wolber constantin.wol...@medicalcolumbus.de wrote: Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber -- - Noble Paul -- - Noble Paul
RE: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor
Instead of specifying CachedSqlEntityProcessor, you can specify SqlEntityProcessor with cacheImpl='SortedMapBackedCache'. If you parametertize this, to have SortedMapBackedCache for full updates but blank for deltas I think it will cache only on the full import. Another option is to parameterize the child queries with a where clause, so if it is creating a new cache with every row, the cache will only contain the data needed for that child row. A third option is to do your delta imports like described here: http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport My experience is that this generally performs better than using the delta import feature anyhow. The trick is on handling deletes, which will require its own entity and the $deleteDocById command. See http://wiki.apache.org/solr/DataImportHandler#Special_Commands But these are all workarounds. This sounds like a bug or some subtle configuration problem. I looked through the JIRA issues and did not see anything like this reported yet, but if you're pretty sure you are doing everything correctly you may want to open a bug ticket. Be sure to flag it as contrib - Dataimporthandler. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Constantin Wolber [mailto:constantin.wol...@medicalcolumbus.de] Sent: Thursday, June 20, 2013 3:21 AM To: solr-user@lucene.apache.org Subject: DataImportHandler: Problems with delta-import and CachedSqlEntityProcessor Hi, i searched for a solution for quite some time but did not manage to find some real hints on how to fix it. I'm using solr 4.3.0 1477023 - simonw - 2013-04-29 15:10:12 running in a tomcat 6 container. My data import setup is basically the following: Data-config.xml: entity name=article dataSource=ds1 query=SELECT * FROM article deltaQuery=SELECT myownid FROM articleHistory WHERE modified_date gt; '${dih.last_index_time} deltaImportQuery=SELECT * FROM article WHERE myownid=${dih.delta.myownid} pk=myownid field column=myownid name=id/ entity name=supplier dataSource=ds2 query=SELECT * FROM supplier WHERE status=1 processor=CachedSqlEntityProcessor cacheKey=SUPPLIER_ID cacheLookup=article.ARTICLE_SUPPLIER_ID /entity entity name=attributes dataSource=ds1 query=SELECT ARTICLE_ID,'Key:'+ATTRIBUTE_KEY+' Value:'+ATTRIBUTE_VALUE FROM attributes cacheKey=ARTICLE_ID cacheLookup=article.myownid processor=CachedSqlEntityProcessor /entity /entity Ok now for the problem: At first I tried everything without the Cache. But the full-import took a very long time. Because the attributes query is pretty slow compared to the rest. As a result I got a processing speed of around 150 Documents/s. When switching everything to the CachedSqlEntityProcessor the full import processed at the speed of 4000 Documents/s So full import is running quite fine. Now I wanted to use the delta import. When running the delta import I was expecting the ramp up time to be about the same as in full import since I need to load the whole table supplier and attributes to the cache in the first step. But when looking into the log file the weird thing is solr seems to refresh the Cache for every single document that is processed. So currently my delta-import is a lot slower than the full-import. I even tried to add the deltaImportQuery parameter to the entity but it doesn't change the behavior at all (of course I know it is not supposed to change anything in the setup I run). The following solutions would be possible in my opinion: 1. Is there any way to tell the config to ignore the Cache when running a delta import? That would help already because we are talking about the maximum of 500 documents changed in 15 minutes compared to over 5 million documents in total. 2. Get solr to not refresh the cash for every document. Best Regards Constantin Wolber