Efficient Sharding with date sorted queries

2009-06-12 Thread Garafola Timothy
I have a solr index which is going to grow 3x in the near future.  I'm
considering using distributed search and was contemplating what would
be the best approach to splitting the index.  Since most of the
searches performed on the index are sorted by date descending, I'm
considering splitting the index based on the created date of the
documents.

From Yonik Seeley's blog post,
http://yonik.wordpress.com/2008/02/27/distributed-search-for-solr/,
I've read that there are two phases to sharding.  The first phase
collects matching ids and documents across the shards.  Then the
second phase collects the stored fields for the documents.  I'm
assuming that this second phase's execution is limited by the number
of rows requested and the number of results.

So let's say I have 2 shards.  The first shard has docs with creation
dates of this year.  The Second shard contains documents from the
previous year.  I run a solr query requesting 10 rows sorted by date
and get 11 from the first shard and 3 from the second.  Will the
initial query only execute the first phase on the second shard?  If
so, that should result in more optimum performance, right?


Thanks,
-Tim


solr_hostname in scripts.conf

2009-03-25 Thread Garafola Timothy
I've a question.  Is it safe to use 'localhost' as solr_hostname in
scripts.conf?

-- 
-Tim


Re: DataImportHandler and delta-import question

2009-03-05 Thread Garafola Timothy
yes, the dataimport.properties file is present in the conf directory
from previous imports.  I'll try the trunk version as you suggested to
see if the problem persists.

Thanks,
Tim

On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള്‍  नोब्ळ्
noble.p...@gmail.com wrote:
 the dataimport.properties is created only after one successful import
 .so it is available only from second import onwards. probably you can
 create one manually and put it in the conf dir.

 On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy timgaraf...@gmail.com 
 wrote:
 Thanks,

 I set up a another test instance of solr and ran a full import within
 the DIH Development Console.  I examined the query and found that
 last_index_time is not getting set in the query.  Yet the value does
 get updated after a full import completes (outside of the development
 console).  Is there some place that I need to set the path to the
 dataimport.properties file?

 On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള്‍  नोब्ळ्
 noble.p...@gmail.com wrote:
 I do not see anything wrong with this .It should have worked . Can you
 check that dataimport.properties is created (by DIH) in the conf
 directory? . check the content?


 are you sure that the query

 select DId from 2_Doc where ModifiedDate  '${dataimporter.last_index_time}'

 works with  a date format -MM-dd HH:mm:ss . This is the format
 which DIH sends the date in . If the format is wrong you may need to
 format it using a dateformat function.

 see here

 http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7


  The trunk DIH can work with Solr1.3 (you may need to put the DIH jar
 and slf4j). Can
 - Show quoted text -
 On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com 
 wrote:
 I'm using solr 1.3 and am trying to get a delta-import with the DIH.
 Recently the wiki, http://wiki.apache.org/solr/DataImportHandler, was
 updated explaining that delta import is a 1.4 feature now but it was
 still possible get a delta using the full import example here,
 http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta.  I
 tried this but each time I run DIH, it reimports all rows and updates.

 Below is my data-config.xml.  I set rootEntity to false and issued
 command=full-importclean=falseoptimize=false through DIH.  Am I
 doing something wrong here or is the DataImportHandlerFaq incorrect?

 dataConfig
        dataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://pencil-somewhere.com:2/SomeDB user=someUser
  password=somePassword/
        document name=
                entity name = item rootEntity=false
                        query = select DId from 2_Doc where
 ModifiedDate  '${dataimporter.last_index_time}'
                                      and DocType != 'Research Articles'
                        entity name=feature pk=DId
 transformer=RegexTransformer
                                query = SELECT d.DId, d.SiteId,
 d.DocTitle, d.DocURL, d.DocDesc,
                                        d.DocType, d.Tags, d.Source,
 d.Last90DaysRFIsPercent,
                                        d.ModifiedDate, d.DocGuid, d.Author,
                                        i.Industry FROM 2_Doc d LEFT
 OUTER JOIN tmp_DocIndustry i
                                        ON (d.DocId=i.DocId AND
 d.SiteId=i.SiteId) where d.DocType != 'Research articles'
                                        and d.DId = '${item.DId}' and
 d.ModifiedDate  '${dataimporter.last_index_time}'
                                field column = DId   name =did/
                                field column = SiteId   name =SiteId/
                                field column = DocId   name =DocId/
                                field column = DocTitle   name 
 =DocTitle/
                                field column = DocURL   name =DocURL/
                                field column = DocDesc name =DocDesc /
                                field column = Snippet
 regex=^(.{0,800})\b.*$ sourceColName=DocDesc/
                                field column = DocType   name 
 =DocType/
                                field column = Tags name =Tags
 splitBy=; sourceColName=Tags/
                                field column = Source   name =Source/
                                field column =
 Last90DaysRFIsPercent   name =Last90DaysRFIsPercent/
                                field column = ModifiedDate   name
 =ModifiedDate/
                                field column = DocGuid   name 
 =DocGuid/
                                field column = Author   name =Author/
                                field column = Industry name
 =Industry sourceColName=Industry/
                        /entity
                /entity
        /document
 /dataConfig

 Thanks,
 -Tim




 --
 --Noble Paul




 --
 -Tim




 --
 --Noble Paul




-- 
-Tim


Re: DataImportHandler and delta-import question

2009-03-05 Thread Garafola Timothy
I tried updating the solr instance I'm testing DIH with, adding the
the dataimport and slf4j jar files to solr.

When I start solr, I get the following error.  Is there something else
which needs to be installed for the nightly build version of DIH to
work in solr release 1.3?

Thanks,
Tim


java.lang.NoClassDefFoundError: org/apache/solr/update/RollbackUpdateCommand
at 
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:95)
at 
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:311)
at org.apache.solr.core.SolrCore.init(SolrCore.java:480)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at 
com.caucho.server.dispatch.FilterManager.createFilter(FilterManager.java:134)
at com.caucho.server.dispatch.FilterManager.init(FilterManager.java:87)
at com.caucho.server.webapp.Application.start(Application.java:1655)
at 
com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
at 
com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72)
at 
com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509)
at 
com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153)
at 
com.caucho.server.webapp.ApplicationContainer.start(ApplicationContainer.java:670)
at com.caucho.server.host.Host.start(Host.java:420)
at 
com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
at 
com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72)
at 
com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509)
at 
com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153)
at com.caucho.server.host.HostContainer.start(HostContainer.java:504)
at com.caucho.server.resin.ServletServer.start(ServletServer.java:971)
at 
com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
at 
com.caucho.server.deploy.AbstractDeployControllerStrategy.start(AbstractDeployControllerStrategy.java:56)
at 
com.caucho.server.deploy.DeployController.start(DeployController.java:517)
at com.caucho.server.resin.ResinServer.start(ResinServer.java:551)
at com.caucho.server.resin.Resin.init(Resin.java)
at com.caucho.server.resin.Resin.main(Resin.java:625)
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.update.RollbackUpdateCommand
at 
com.caucho.loader.DynamicClassLoader.findClass(DynamicClassLoader.java:1130)
at 
com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1072)
at 
com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1021)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
... 26 more


On Thu, Mar 5, 2009 at 9:10 AM, Garafola Timothy timgaraf...@gmail.com wrote:
 yes, the dataimport.properties file is present in the conf directory
 from previous imports.  I'll try the trunk version as you suggested to
 see if the problem persists.

 Thanks,
 Tim

 On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള്‍  नोब्ळ्
 noble.p...@gmail.com wrote:
 the dataimport.properties is created only after one successful import
 .so it is available only from second import onwards. probably you can
 create one manually and put it in the conf dir.

 On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy timgaraf...@gmail.com 
 wrote:
 Thanks,

 I set up a another test instance of solr and ran a full import within
 the DIH Development Console.  I examined the query and found that
 last_index_time is not getting set in the query.  Yet the value does
 get updated after a full import completes (outside of the development
 console).  Is there some place that I need to set the path to the
 dataimport.properties file?

 On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള്‍  नोब्ळ्
 noble.p...@gmail.com wrote:
 I do not see anything wrong with this .It should have worked . Can you
 check that dataimport.properties is created (by DIH) in the conf
 directory? . check the content?


 are you sure that the query

 select DId from 2_Doc where ModifiedDate  
 '${dataimporter.last_index_time}'

 works with  a date format -MM-dd HH:mm:ss . This is the format
 which DIH sends the date in . If the format is wrong you may need to
 format it using a dateformat function.

 see here

 http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7


  The trunk DIH can work with Solr1.3 (you may need to put the DIH jar
 and slf4j). Can
 - Show quoted text -
 On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com 
 wrote:
 I'm using solr 1.3 and am trying to get a delta-import

Re: DataImportHandler and delta-import question

2009-03-05 Thread Garafola Timothy
Thanks.  Can you recommend a build I can try?

On Thu, Mar 5, 2009 at 3:09 PM, Marc Sturlese marc.sturl...@gmail.com wrote:

 I am not sure if RollBackUpdateCommand was yet developed in the oficial solr
 1.3 release. I think it's just in the nightly builds. Looks like your
 dataimport package is too new. I think you should try to use that dataimport
 release with a solr nightly or try to grab an older dataimport release.


 Tim Garafola wrote:

 I tried updating the solr instance I'm testing DIH with, adding the
 the dataimport and slf4j jar files to solr.

 When I start solr, I get the following error.  Is there something else
 which needs to be installed for the nightly build version of DIH to
 work in solr release 1.3?

 Thanks,
 Tim


 java.lang.NoClassDefFoundError:
 org/apache/solr/update/RollbackUpdateCommand
       at
 org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:95)
       at
 org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:311)
       at org.apache.solr.core.SolrCore.init(SolrCore.java:480)
       at
 org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119)
       at
 org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
       at
 com.caucho.server.dispatch.FilterManager.createFilter(FilterManager.java:134)
       at com.caucho.server.dispatch.FilterManager.init(FilterManager.java:87)
       at com.caucho.server.webapp.Application.start(Application.java:1655)
       at
 com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
       at
 com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72)
       at
 com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509)
       at
 com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153)
       at
 com.caucho.server.webapp.ApplicationContainer.start(ApplicationContainer.java:670)
       at com.caucho.server.host.Host.start(Host.java:420)
       at
 com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
       at
 com.caucho.server.deploy.StartAutoRedeployAutoStrategy.startOnInit(StartAutoRedeployAutoStrategy.java:72)
       at
 com.caucho.server.deploy.DeployController.startOnInit(DeployController.java:509)
       at
 com.caucho.server.deploy.DeployContainer.start(DeployContainer.java:153)
       at com.caucho.server.host.HostContainer.start(HostContainer.java:504)
       at com.caucho.server.resin.ServletServer.start(ServletServer.java:971)
       at
 com.caucho.server.deploy.DeployController.startImpl(DeployController.java:621)
       at
 com.caucho.server.deploy.AbstractDeployControllerStrategy.start(AbstractDeployControllerStrategy.java:56)
       at
 com.caucho.server.deploy.DeployController.start(DeployController.java:517)
       at com.caucho.server.resin.ResinServer.start(ResinServer.java:551)
       at com.caucho.server.resin.Resin.init(Resin.java)
       at com.caucho.server.resin.Resin.main(Resin.java:625)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.solr.update.RollbackUpdateCommand
       at
 com.caucho.loader.DynamicClassLoader.findClass(DynamicClassLoader.java:1130)
       at
 com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1072)
       at
 com.caucho.loader.DynamicClassLoader.loadClass(DynamicClassLoader.java:1021)
       at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
       ... 26 more


 On Thu, Mar 5, 2009 at 9:10 AM, Garafola Timothy timgaraf...@gmail.com
 wrote:
 yes, the dataimport.properties file is present in the conf directory
 from previous imports.  I'll try the trunk version as you suggested to
 see if the problem persists.

 Thanks,
 Tim

 On Wed, Mar 4, 2009 at 7:54 PM, Noble Paul നോബിള്‍  नोब्ळ्
 noble.p...@gmail.com wrote:
 the dataimport.properties is created only after one successful import
 .so it is available only from second import onwards. probably you can
 create one manually and put it in the conf dir.

 On Thu, Mar 5, 2009 at 12:52 AM, Garafola Timothy
 timgaraf...@gmail.com wrote:
 Thanks,

 I set up a another test instance of solr and ran a full import within
 the DIH Development Console.  I examined the query and found that
 last_index_time is not getting set in the query.  Yet the value does
 get updated after a full import completes (outside of the development
 console).  Is there some place that I need to set the path to the
 dataimport.properties file?

 On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള്‍  नोब्ळ्
 noble.p...@gmail.com wrote:
 I do not see anything wrong with this .It should have worked . Can you
 check that dataimport.properties is created (by DIH) in the conf
 directory? . check the content?


 are you sure that the query

 select DId from 2_Doc where ModifiedDate 
 '${dataimporter.last_index_time}'

 works with  a date format -MM-dd HH:mm:ss . This is the format
 which DIH sends the date

Re: DataImportHandler and delta-import question

2009-03-04 Thread Garafola Timothy
Thanks,

I set up a another test instance of solr and ran a full import within
the DIH Development Console.  I examined the query and found that
last_index_time is not getting set in the query.  Yet the value does
get updated after a full import completes (outside of the development
console).  Is there some place that I need to set the path to the
dataimport.properties file?

On Tue, Mar 3, 2009 at 8:03 PM, Noble Paul നോബിള്‍  नोब्ळ्
noble.p...@gmail.com wrote:
 I do not see anything wrong with this .It should have worked . Can you
 check that dataimport.properties is created (by DIH) in the conf
 directory? . check the content?


 are you sure that the query

 select DId from 2_Doc where ModifiedDate  '${dataimporter.last_index_time}'

 works with  a date format -MM-dd HH:mm:ss . This is the format
 which DIH sends the date in . If the format is wrong you may need to
 format it using a dateformat function.

 see here

 http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7


  The trunk DIH can work with Solr1.3 (you may need to put the DIH jar
 and slf4j). Can
 - Show quoted text -
 On Wed, Mar 4, 2009 at 3:53 AM, Garafola Timothy timgaraf...@gmail.com 
 wrote:
 I'm using solr 1.3 and am trying to get a delta-import with the DIH.
 Recently the wiki, http://wiki.apache.org/solr/DataImportHandler, was
 updated explaining that delta import is a 1.4 feature now but it was
 still possible get a delta using the full import example here,
 http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta.  I
 tried this but each time I run DIH, it reimports all rows and updates.

 Below is my data-config.xml.  I set rootEntity to false and issued
 command=full-importclean=falseoptimize=false through DIH.  Am I
 doing something wrong here or is the DataImportHandlerFaq incorrect?

 dataConfig
        dataSource driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://pencil-somewhere.com:2/SomeDB user=someUser
  password=somePassword/
        document name=
                entity name = item rootEntity=false
                        query = select DId from 2_Doc where
 ModifiedDate  '${dataimporter.last_index_time}'
                                      and DocType != 'Research Articles'
                        entity name=feature pk=DId
 transformer=RegexTransformer
                                query = SELECT d.DId, d.SiteId,
 d.DocTitle, d.DocURL, d.DocDesc,
                                        d.DocType, d.Tags, d.Source,
 d.Last90DaysRFIsPercent,
                                        d.ModifiedDate, d.DocGuid, d.Author,
                                        i.Industry FROM 2_Doc d LEFT
 OUTER JOIN tmp_DocIndustry i
                                        ON (d.DocId=i.DocId AND
 d.SiteId=i.SiteId) where d.DocType != 'Research articles'
                                        and d.DId = '${item.DId}' and
 d.ModifiedDate  '${dataimporter.last_index_time}'
                                field column = DId   name =did/
                                field column = SiteId   name =SiteId/
                                field column = DocId   name =DocId/
                                field column = DocTitle   name 
 =DocTitle/
                                field column = DocURL   name =DocURL/
                                field column = DocDesc name =DocDesc /
                                field column = Snippet
 regex=^(.{0,800})\b.*$ sourceColName=DocDesc/
                                field column = DocType   name =DocType/
                                field column = Tags name =Tags
 splitBy=; sourceColName=Tags/
                                field column = Source   name =Source/
                                field column =
 Last90DaysRFIsPercent   name =Last90DaysRFIsPercent/
                                field column = ModifiedDate   name
 =ModifiedDate/
                                field column = DocGuid   name =DocGuid/
                                field column = Author   name =Author/
                                field column = Industry name
 =Industry sourceColName=Industry/
                        /entity
                /entity
        /document
 /dataConfig

 Thanks,
 -Tim




 --
 --Noble Paul




-- 
-Tim