Efficient Sharding with date sorted queries
I have a solr index which is going to grow 3x in the near future. I'm considering using distributed search and was contemplating what would be the best approach to splitting the index. Since most of the searches performed on the index are sorted by date descending, I'm considering splitting the index based on the created date of the documents. From Yonik Seeley's blog post, http://yonik.wordpress.com/2008/02/27/distributed-search-for-solr/, I've read that there are two phases to sharding. The first phase collects matching ids and documents across the shards. Then the second phase collects the stored fields for the documents. I'm assuming that this second phase's execution is limited by the number of rows requested and the number of results. So let's say I have 2 shards. The first shard has docs with creation dates of this year. The Second shard contains documents from the previous year. I run a solr query requesting 10 rows sorted by date and get 11 from the first shard and 3 from the second. Will the initial query only execute the first phase on the second shard? If so, that should result in more optimum performance, right? Thanks, -Tim
solr_hostname in scripts.conf
I've a question. Is it safe to use 'localhost' as solr_hostname in scripts.conf? -- -Tim
Re: DataImportHandler and delta-import question
yes, the dataimport.properties file is present in the conf directory from previous imports. I'll try the trunk version as you suggested to see if the problem persists. Thanks, Tim On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: the dataimport.properties is created only after one successful import .so it is available only from second import onwards. probably you can create one manually and put it in the conf dir. On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy timgaraf...@gmail.com wrote: Thanks, I set up a another test instance of solr and ran a full import within the DIH Development Console. I examined the query and found that last_index_time is not getting set in the query. Yet the value does get updated after a full import completes (outside of the development console). Is there some place that I need to set the path to the dataimport.properties file? On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: I do not see anything wrong with this .It should have worked . Can you check that dataimport.properties is created (by DIH) in the conf directory? . check the content? are you sure that the query select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' works with a date format -MM-dd HH:mm:ss . This is the format which DIH sends the date in . If the format is wrong you may need to format it using a dateformat function. see here http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7 The trunk DIH can work with Solr1.3 (you may need to put the DIH jar and slf4j). Can - Show quoted text - On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com wrote: I'm using solr 1.3 and am trying to get a delta-import with the DIH. Recently the wiki, http://wiki.apache.org/solr/DataImportHandler, was updated explaining that delta import is a 1.4 feature now but it was still possible get a delta using the full import example here, http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta. I tried this but each time I run DIH, it reimports all rows and updates. Below is my data-config.xml. I set rootEntity to false and issued command=full-importclean=falseoptimize=false through DIH. Am I doing something wrong here or is the DataImportHandlerFaq incorrect? dataConfig dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://pencil-somewhere.com:2/SomeDB user=someUser password=somePassword/ document name= entity name = item rootEntity=false query = select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' and DocType != 'Research Articles' entity name=feature pk=DId transformer=RegexTransformer query = SELECT d.DId, d.SiteId, d.DocTitle, d.DocURL, d.DocDesc, d.DocType, d.Tags, d.Source, d.Last90DaysRFIsPercent, d.ModifiedDate, d.DocGuid, d.Author, i.Industry FROM 2_Doc d LEFT OUTER JOIN tmp_DocIndustry i ON (d.DocId=i.DocId AND d.SiteId=i.SiteId) where d.DocType != 'Research articles' and d.DId = '${item.DId}' and d.ModifiedDate '${dataimporter.last_index_time}' field column = DId name =did/ field column = SiteId name =SiteId/ field column = DocId name =DocId/ field column = DocTitle name =DocTitle/ field column = DocURL name =DocURL/ field column = DocDesc name =DocDesc / field column = Snippet regex=^(.{0,800})\b.*$ sourceColName=DocDesc/ field column = DocType name =DocType/ field column = Tags name =Tags splitBy=; sourceColName=Tags/ field column = Source name =Source/ field column = Last90DaysRFIsPercent name =Last90DaysRFIsPercent/ field column = ModifiedDate name =ModifiedDate/ field column = DocGuid name =DocGuid/ field column = Author name =Author/ field column = Industry name =Industry sourceColName=Industry/ /entity /entity /document /dataConfig Thanks, -Tim -- --Noble Paul -- -Tim -- --Noble Paul -- -Tim
Re: DataImportHandler and delta-import question
I tried updating the solr instance I'm testing DIH with, adding the the dataimport and slf4j jar files to solr. When I start solr, I get the following error. Is there something else which needs to be installed for the nightly build version of DIH to work in solr release 1.3? Thanks, Tim java.lang.NoClassDefFoundError: org/apache/solr/update/RollbackUpdateCommand at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:95) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:311) at org.apache.solr.core.SolrCore.init(SolrCore.java:480) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at com.caucho.server.dispatch.FilterManager.createFilter(FilterManager.java:134) at com.caucho.server.dispatch.FilterManager.init(FilterManager.java:87) at com.caucho.server.webapp.Application.start(Application.java:1655) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72) at com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509) at com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153) at com.caucho.server.webapp.ApplicationContainer.start(ApplicationContainer.java:670) at com.caucho.server.host.Host.start(Host.java:420) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72) at com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509) at com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153) at com.caucho.server.host.HostContainer.start(HostContainer.java:504) at com.caucho.server.resin.ServletServer.start(ServletServer.java:971) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.AbstractDeployControllerStrategy.start(AbstractDeployControllerStrategy.java:56) at com.caucho.server.deploy.DeployController.start(DeployController.java:517) at com.caucho.server.resin.ResinServer.start(ResinServer.java:551) at com.caucho.server.resin.Resin.init(Resin.java) at com.caucho.server.resin.Resin.main(Resin.java:625) Caused by: java.lang.ClassNotFoundException: org.apache.solr.update.RollbackUpdateCommand at com.caucho.loader.DynamicClassLoader.findClass(DynamicClassLoader.java:1130) at com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1072) at com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1021) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 26 more On Thu, Mar 5, 2009 at 9:10 AM, Garafola Timothy timgaraf...@gmail.com wrote: yes, the dataimport.properties file is present in the conf directory from previous imports. I'll try the trunk version as you suggested to see if the problem persists. Thanks, Tim On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: the dataimport.properties is created only after one successful import .so it is available only from second import onwards. probably you can create one manually and put it in the conf dir. On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy timgaraf...@gmail.com wrote: Thanks, I set up a another test instance of solr and ran a full import within the DIH Development Console. I examined the query and found that last_index_time is not getting set in the query. Yet the value does get updated after a full import completes (outside of the development console). Is there some place that I need to set the path to the dataimport.properties file? On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: I do not see anything wrong with this .It should have worked . Can you check that dataimport.properties is created (by DIH) in the conf directory? . check the content? are you sure that the query select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' works with a date format -MM-dd HH:mm:ss . This is the format which DIH sends the date in . If the format is wrong you may need to format it using a dateformat function. see here http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7 The trunk DIH can work with Solr1.3 (you may need to put the DIH jar and slf4j). Can - Show quoted text - On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com wrote: I'm using solr 1.3 and am trying to get a delta-import
Re: DataImportHandler and delta-import question
Thanks. Can you recommend a build I can try? On Thu, Mar 5, 2009 at 3:09 PM, Marc Sturlese marc.sturl...@gmail.com wrote: I am not sure if RollBackUpdateCommand was yet developed in the oficial solr 1.3 release. I think it's just in the nightly builds. Looks like your dataimport package is too new. I think you should try to use that dataimport release with a solr nightly or try to grab an older dataimport release. Tim Garafola wrote: I tried updating the solr instance I'm testing DIH with, adding the the dataimport and slf4j jar files to solr. When I start solr, I get the following error. Is there something else which needs to be installed for the nightly build version of DIH to work in solr release 1.3? Thanks, Tim java.lang.NoClassDefFoundError: org/apache/solr/update/RollbackUpdateCommand at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:95) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:311) at org.apache.solr.core.SolrCore.init(SolrCore.java:480) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) at com.caucho.server.dispatch.FilterManager.createFilter(FilterManager.java:134) at com.caucho.server.dispatch.FilterManager.init(FilterManager.java:87) at com.caucho.server.webapp.Application.start(Application.java:1655) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72) at com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509) at com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153) at com.caucho.server.webapp.ApplicationContainer.start(ApplicationContainer.java:670) at com.caucho.server.host.Host.start(Host.java:420) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72) at com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509) at com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153) at com.caucho.server.host.HostContainer.start(HostContainer.java:504) at com.caucho.server.resin.ServletServer.start(ServletServer.java:971) at com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621) at com.caucho.server.deploy.AbstractDeployControllerStrategy.start(AbstractDeployControllerStrategy.java:56) at com.caucho.server.deploy.DeployController.start(DeployController.java:517) at com.caucho.server.resin.ResinServer.start(ResinServer.java:551) at com.caucho.server.resin.Resin.init(Resin.java) at com.caucho.server.resin.Resin.main(Resin.java:625) Caused by: java.lang.ClassNotFoundException: org.apache.solr.update.RollbackUpdateCommand at com.caucho.loader.DynamicClassLoader.findClass(DynamicClassLoader.java:1130) at com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1072) at com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1021) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) ... 26 more On Thu, Mar 5, 2009 at 9:10 AM, Garafola Timothy timgaraf...@gmail.com wrote: yes, the dataimport.properties file is present in the conf directory from previous imports. I'll try the trunk version as you suggested to see if the problem persists. Thanks, Tim On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: the dataimport.properties is created only after one successful import .so it is available only from second import onwards. probably you can create one manually and put it in the conf dir. On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy timgaraf...@gmail.com wrote: Thanks, I set up a another test instance of solr and ran a full import within the DIH Development Console. I examined the query and found that last_index_time is not getting set in the query. Yet the value does get updated after a full import completes (outside of the development console). Is there some place that I need to set the path to the dataimport.properties file? On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: I do not see anything wrong with this .It should have worked . Can you check that dataimport.properties is created (by DIH) in the conf directory? . check the content? are you sure that the query select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' works with a date format -MM-dd HH:mm:ss . This is the format which DIH sends the date
Re: DataImportHandler and delta-import question
Thanks, I set up a another test instance of solr and ran a full import within the DIH Development Console. I examined the query and found that last_index_time is not getting set in the query. Yet the value does get updated after a full import completes (outside of the development console). Is there some place that I need to set the path to the dataimport.properties file? On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: I do not see anything wrong with this .It should have worked . Can you check that dataimport.properties is created (by DIH) in the conf directory? . check the content? are you sure that the query select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' works with a date format -MM-dd HH:mm:ss . This is the format which DIH sends the date in . If the format is wrong you may need to format it using a dateformat function. see here http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7 The trunk DIH can work with Solr1.3 (you may need to put the DIH jar and slf4j). Can - Show quoted text - On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com wrote: I'm using solr 1.3 and am trying to get a delta-import with the DIH. Recently the wiki, http://wiki.apache.org/solr/DataImportHandler, was updated explaining that delta import is a 1.4 feature now but it was still possible get a delta using the full import example here, http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta. I tried this but each time I run DIH, it reimports all rows and updates. Below is my data-config.xml. I set rootEntity to false and issued command=full-importclean=falseoptimize=false through DIH. Am I doing something wrong here or is the DataImportHandlerFaq incorrect? dataConfig dataSource driver=com.mysql.jdbc.Driver url=jdbc:mysql://pencil-somewhere.com:2/SomeDB user=someUser password=somePassword/ document name= entity name = item rootEntity=false query = select DId from 2_Doc where ModifiedDate '${dataimporter.last_index_time}' and DocType != 'Research Articles' entity name=feature pk=DId transformer=RegexTransformer query = SELECT d.DId, d.SiteId, d.DocTitle, d.DocURL, d.DocDesc, d.DocType, d.Tags, d.Source, d.Last90DaysRFIsPercent, d.ModifiedDate, d.DocGuid, d.Author, i.Industry FROM 2_Doc d LEFT OUTER JOIN tmp_DocIndustry i ON (d.DocId=i.DocId AND d.SiteId=i.SiteId) where d.DocType != 'Research articles' and d.DId = '${item.DId}' and d.ModifiedDate '${dataimporter.last_index_time}' field column = DId name =did/ field column = SiteId name =SiteId/ field column = DocId name =DocId/ field column = DocTitle name =DocTitle/ field column = DocURL name =DocURL/ field column = DocDesc name =DocDesc / field column = Snippet regex=^(.{0,800})\b.*$ sourceColName=DocDesc/ field column = DocType name =DocType/ field column = Tags name =Tags splitBy=; sourceColName=Tags/ field column = Source name =Source/ field column = Last90DaysRFIsPercent name =Last90DaysRFIsPercent/ field column = ModifiedDate name =ModifiedDate/ field column = DocGuid name =DocGuid/ field column = Author name =Author/ field column = Industry name =Industry sourceColName=Industry/ /entity /entity /document /dataConfig Thanks, -Tim -- --Noble Paul -- -Tim