Re: Interesting stuff; Solr as a syslog store.
Thanks Antonio for sharing this. I believe this could be one of the interesting case studies for Solr In Action, if you are interested in sharing a bit more - I am sure the authors would be more interested for upcoming revisions. -- K K. On 02/12/2010 06:02 PM, Antonio Lobato wrote: Hey everyone, I don't actually have a question, but I just thought I'd share something really cool that I did with Solr for our company. We run a good amount of servers, well into the several hundreds, and naturally we need a way to centralize all of the system logs. For a while we used a commercial solution to centralize and search our logs, but they wanted to charge us tens of thousands of dollars for just one gigabyte/day more of indexed data. So I said forget it, I'll write my own solution! We already use Solr for some of our other backend searching systems, so I came up with an idea to index all of our logs to Solr. I wrote a daemon in perl that listens on the syslog port, and pointed every single system's syslog to forward to this single server. From there, this daemon will write to a Solr indexing server after parsing them into fields, such as date/time, host, program, pid, text, etc. I then wrote a cool javascript/ajax web front end for Solr searching, and bam. Real time searching of all of our syslogs from a web interface, for no cost! Just thought this would be a neat story to share with you all. I've really grown to love Solr, it's something else! Thanks, -Antonio
Re: Searching .msg files
I remember seeing a similar thread in the lucene user mailing list. You can check the archives of the same. As regarding the strategies - there could be 2 of them . * you can create an index per user and store the email content involving the user in the same and use it for search. (or) * you can have 1 gigantic index , and have the To/Cc names as fields in them and all searches by a given user would go through an initial filter-pass on this index. solr can of course, index a variety of content (see tika project ) and not restricted to xml at all. You would need to weight the pros / cons of each of them depending on the corpus of data you are talking about and usage / performance expectations of the search. Once you identify the strategy as appropriate - you can define the solr schema for the fields and use the same. Abhishek Srivastava wrote: Hello Everyone, In my company, we store a lot of old emails (.msg files) in a database (done for the purpose of legal compliance). The users have been asking us to give search functionality on the old emails. One of the primary requirement is that when people search, they should only be able to search in their own emails (emails in which they were in the to, cc or bcc list). How can solr be used? from what I know about this product is that it only searches xml content... so I will have to extract the body of the email and convert it to xml right? How will I limit the search results to only those emails where the user who is searching was in the to, cc or bcc list? Please do recommend me an approach for providing a solution to our requirement.
Re: latency in solr response is observed after index is updated
What would be the average doc size. What is the autoCommit frequency set in solrconfig.xml . Another place to look at is the field cache size and the nature of warmup queries run after a new searcher is created ( happens due to a commit ). Bharath Venkatesh wrote: Hi Kalidoss, I am not aware of using solr-config for committing the document . but I have mentioned below how we update and commit documents: curl http://solr_url/update --data-binary @feeds.xml -H 'Content-type:text/xml; charset=utf-8' curl http://solr_url/update --data-binary 'commit/' -H 'Content-type:text/xml; charset=utf-8' where feeds.xml contains the document in xml format we have master and slave replication for solr server. updates happens in master , snappuller and snapinstaller is run on slaves periodically queries don't happen at master , only happens at slaves is there any thing which can be said from above information ? Thanks, Bharath -Original Message- From: kalidoss [mailto:kalidoss.muthuramalin...@sifycorp.com] Sent: Tue 12/1/2009 2:38 PM To: solr-user@lucene.apache.org Subject: Re: latency in solr response is observed after index is updated r u using solr-config for committing the document? bharath venkatesh wrote: Hi, We are observing latency (some times huge latency upto 10-20 secs) in solr response after index is updated . whats the reason of this latency and how can it be minimized ? Note: our index size is pretty large. any help would be appreciated as we largely affected by it Thanks in advance. Bharath This message is intended only for the use of the addressee and may contain information that is privileged, confidential and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify us immediately by return e-mail and delete this e-mail and all attachments from your system.
Multiple DisMax Queries spanning across multiple fields
For a particular requirement we have - we need to do a query that is a combination of multiple dismax queries behind the scenes. (Using solr 1.4 nightly ). The DisMaxQParser org.apache.solr.search.DisMaxQParser ( details at - http://wiki.apache.org/solr/DisMaxRequestHandler ) takes in the /qf/ parameters and applies the parser to /q /and computes relevance based on the same. We need to have a case where, the final query is a combination of{ (q = keywords, qf = Map of field weights) , (q1, qf1 ) , (q2, qf2 ) .. etc } combined by a boolean AND , for the individual queries. Creating a custom QParser works right away as below. public class MultiTermDisMaxQParser extends DisMaxQParser { .. .. .. @Override public Query parse() throws ParseException { BooleanQuery finalQuery = new BooleanQuery(true); Query superQuery = super.parse(); // Handles { (q, qf) combination }. ... ... // finalQuery adds superQuery with a weight. return finalQuery; } } Curious to see if we have an alternate method to implement the same / any other alternate suggestions to the problem itself.
Re: Flipping data dirs for an (/multiple) SolrCore without affecting search / IndexReaders
Chris Hostetter wrote: : We have an architecture where we want to flip the solr data.dir (massive : dataset) while running and serving search requests with minimal downtime. ... : 1) What is the fastest / best possible way to get step 1 done ,through a : pluggable architecture. : : Currently - I have written a request handler as follows, that takes care of : creating the core. What is the best way to change dataDir (got as input from : SolrQueryRequest) before creating SolrCore-s. you shouldn't need any custom plugin code to achieve this ... what you describe wounds like exactly what the SWAP command on the CoreAdmin handler was designed for -- CREATE your new core (using the new data dir) warm it however you want (either via the solrconfig.xml or by explicitly hitting it with queries) and once it's warmed up send the SWAP command to replace the existing core with the name you want to use. : 2) When a close() happens on an existing SolrCore - what happens when there is : a long running IndexReader query on that SolrCore . Is that terminated : abruptly / would the close wait until the IndexReaders completes the Query. any existing uses of the Core will continue to finish (i'm not sure of hte timeline of your question, ut i'm guessing this was before the recent jira issue about the close() method and ref counts where this was better explained, correct?) -Hoss Thanks Hoss for the explanation regarding changing the data directory. Yes - this was before the jira issue discussions for the close() method.
Approximate release date for 1.4
Just curious - if we have an approximate target release date for 1.4 / list of milestones / feature sets for the same.
Re: Nightly build - 2008-12-17.tgz - build error - java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main
Thanks Toby. Aliter: Under contrib/javascript/build.xml - dist target - I removed the dependency on 'docs' , to circumvent the problem. But may be - it would be great to get js.jar from the rhino library distributed ( if not for license contradictions) to circumvent this. Toby Cole wrote: I came across this too earlier, I just deleted the contrib/javascript directory. Of course, if you need javascript library then you'll have to get it building. Sorry, probably not that helpful. :) Toby. On 17 Dec 2008, at 17:03, Kay Kay wrote: I downloaded the latest .tgz and ran $ ant dist docs: [mkdir] Created dir: /opt/src/apache-solr-nightly/contrib/javascript/dist/doc [java] Exception in thread main java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main [java] at JsRun.main(Unknown Source) [java] Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.tools.shell.Main [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:200) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:188) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:307) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:252) [java] at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) [java] ... 1 more BUILD FAILED /opt/src/apache-solr-nightly/common-build.xml:335: The following error occurred while executing this line: /opt/src/apache-solr-nightly/common-build.xml:212: The following error occurred while executing this line: /opt/src/apache-solr-nightly/contrib/javascript/build.xml:74: Java returned: 1 and came across the above mentioned error. The class seems to be from the rhino (mozilla js ) library. Is it supposed to be packaged by default / is there a license restriction that prevents from being so . Toby Cole Software Engineer Semantico Lees House, Floor 1, 21-23 Dyke Road, Brighton BN1 3FE T: +44 (0)1273 358 238 F: +44 (0)1273 723 232 E: toby.c...@semantico.com W: www.semantico.com
Nightly build - 2008-12-17.tgz - build error - java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main
I downloaded the latest .tgz and ran $ ant dist docs: [mkdir] Created dir: /opt/src/apache-solr-nightly/contrib/javascript/dist/doc [java] Exception in thread main java.lang.NoClassDefFoundError: org/mozilla/javascript/tools/shell/Main [java] at JsRun.main(Unknown Source) [java] Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.tools.shell.Main [java] at java.net.URLClassLoader$1.run(URLClassLoader.java:200) [java] at java.security.AccessController.doPrivileged(Native Method) [java] at java.net.URLClassLoader.findClass(URLClassLoader.java:188) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:307) [java] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) [java] at java.lang.ClassLoader.loadClass(ClassLoader.java:252) [java] at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) [java] ... 1 more BUILD FAILED /opt/src/apache-solr-nightly/common-build.xml:335: The following error occurred while executing this line: /opt/src/apache-solr-nightly/common-build.xml:212: The following error occurred while executing this line: /opt/src/apache-solr-nightly/contrib/javascript/build.xml:74: Java returned: 1 and came across the above mentioned error. The class seems to be from the rhino (mozilla js ) library. Is it supposed to be packaged by default / is there a license restriction that prevents from being so .
Flipping data dirs for an (/multiple) SolrCore without affecting search / IndexReaders
We have an architecture where we want to flip the solr data.dir (massive dataset) while running and serving search requests with minimal downtime. Some additional requirements. * While ideally - we want the Solr Search clients to continue to serve from the indices as soon as possible -the overriding requirement is that the downtime for the Search Solr instances should be as less as possible.So when a new set of (Lucene) indices come in - the algorithm we are experimenting with: - create a new solrcore instance with the revised data directory. - warm up the solrcore instance with some test queries. - register the new solrcore instance with the same name as the old one, so that all new queries from the clients are to the new SolrCore instance. - As part of register (String, SolrCore, boolean ) - the III parameter when set to false , closes the core connection. I am trying to understand more about the first and the fourth ( last) steps. 1) What is the fastest / best possible way to get step 1 done ,through a pluggable architecture. Currently - I have written a request handler as follows, that takes care of creating the core. What is the best way to change dataDir (got as input from SolrQueryRequest) before creating SolrCore-s. public class CustomRequestHandler extends RequestHandlerBase implements SolrCoreAware { private CoreDescriptor coreDescriptor; private String coreName; @Override public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) throws Exception { CoreContainer container = this.coreDescriptor.getCoreContainer(); // TODO: Parse XML to extract data // container.reload(this.coreName); // or // 2. // TODO: Set the new configuration for the data directory / before creating the new core. SolrCore newCore = container.create(this.coreDescriptor); container.register(this.coreName, newCore, false); } @Override public void inform(SolrCore core) { coreDescriptor = core.getCoreDescriptor(); coreName = core.getName(); } } 2) When a close() happens on an existing SolrCore - what happens when there is a long running IndexReader query on that SolrCore . Is that terminated abruptly / would the close wait until the IndexReaders completes the Query. * The same process is repeated potentially for multiple SolrCores as well, with additional closeHooks that might do some heavy i/o tasks - talking over the network etc. Right now - these long running processes are done in an independent thread so that they do not block SolrCore.close() with the currently nightly builds.
Solrj client - CommonsHttpSolrServer - getting solr.solr.home
I am reading the wiki here at - http://wiki.apache.org/solr/Solrj . Is there a requestHandler ( may be - some admin handler ) already present that can retrieve the solr.solr.home for a given CommonsHttpSolrServer instance ( for a given solr endpoint), through an api.
Solrj - SolrQuery - specifying SolrCore - when the Solr Server has multiple cores
Hi - I am looking at the article here with a brief introduction to SolrJ . http://www.ibm.com/developerworks/library/j-solr-update/index.html?ca=dgr-jw17SolrS_Tact=105AGX59S_CMP=GRsitejw17#solrj . In case we have multiple SolrCores in the server application - (since 1.3) - how do I specify as part of SolrQuery as to which core needs to be used for the given query. I am trying to dig out the information from the code. Meanwhile, if someone is aware of the same - please suggest some pointers.
Re: Solrj - SolrQuery - specifying SolrCore - when the Solr Server has multiple cores
Thanks Yonik for the clarification. Yonik Seeley wrote: A solr core is like a separate solr server... so create a new CommonsHttpSolrServer that points at the core. You probably want to create and reuse a single HttpClient instance for the best efficiency. -Yonik On Mon, Dec 15, 2008 at 11:06 AM, Kay Kay kaykay.uni...@gmail.com wrote: Hi - I am looking at the article here with a brief introduction to SolrJ . http://www.ibm.com/developerworks/library/j-solr-update/index.html?ca=dgr-jw17SolrS_Tact=105AGX59S_CMP=GRsitejw17#solrj . In case we have multiple SolrCores in the server application - (since 1.3) - how do I specify as part of SolrQuery as to which core needs to be used for the given query. I am trying to dig out the information from the code. Meanwhile, if someone is aware of the same - please suggest some pointers.
Re: Stopping / Starting IndexReaders in Solr 1.3+
Erik Hatcher wrote: Maybe the PingRequestHandler can help? It can check for the existence of a file (see solrconfig.xml for healthcheck) and return an error if it is not there. This wouldn't prevent Solr from responding to requests, but if a client used that information to determine whether to make requests or not it'd do the trick. Thanks Erik. The checking of a configuration file is helpful to see if there are any active clients. But my intention is to stop the client servicing altogether and then restart / warm up the IndexReaders from scratch. Erik On Dec 13, 2008, at 12:54 AM, Kay Kay wrote: For a particular application of ours - we need to suspend the Solr server from doing any query operation ( IndexReader-s) for sometime, and then after sometime in the near future ( in minutes ) - reinitialize / warm IndexReaders once again and get moving. It is a little bit different from optimize since this server is only supposed to read the data and not add create segments . But we want to suspend it as an initial test case for one of our load balancers. (Restarting Solr is an option though we want to get to that as a last resort ).
Re: How can i indexing MS-Outlook files?
You can check out the format of the MS-Outlook files. If they happen to be plain text - may be a little bit of parsing to remove the protocol headers would be needed. Otherwise - you can check with Thunderbird / OpenOffice teams to see how they parse the data when they import from MS-Outlook (if they are supported that is. ). RaghavPrabhu wrote: Hi Folks, I want to indexing MS-Outlook mails in my data directory.How can i perform this function? Please help me and give the solution as soon as possible. Thanks in advance Prabhu.K
Solr - DataImportHandler - Large Dataset results ?
As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler - I am seeing the following fragment. dataSource driver=org.hsqldb.jdbcDriver url=jdbc:hsqldb:/temp/example/ex user=sa / document name=products entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / .. /entity /document /dataSource My scaled-down application looks very similar along these lines but where my resultset is so big that it cannot fit within main memory by any chance. So I was planning to split this single query into multiple subqueries - with another conditional based on the id . ( id 0 and id 100 , say ) . I am curious if there is any way to specify another conditional clause , (splitData Column = id batch=1 /, where the column is supposed to be an integer value) - and internally , the implementation could actually generate the subqueries - i) get the min , max of the numeric column , and send queries to the database based on the batch size ii) Add Documents for each batch and close the resultset . This might end up putting more load on the database (but at least the dataset would fit in the main memory ). Let me know if anyone else had run into similar issues and how this was encountered.
Re: Solr - DataImportHandler - Large Dataset results ?
I am using MySQL. I believe (since MySQL 5) supports streaming. On more about streaming - can we assume that when the database driver supports streaming , the resultset iterator is a forward directional iterator. If , say the streaming size is 10K records and we are trying to retrieve a total of 100K records - what exactly happens when the threshold is reached , (say , the first 10K records were retrieved ). Are the previous set of records thrown away and replaced in memory by the new batch of records. --- On Fri, 12/12/08, Shalin Shekhar Mangar shalinman...@gmail.com wrote: From: Shalin Shekhar Mangar shalinman...@gmail.com Subject: Re: Solr - DataImportHandler - Large Dataset results ? To: solr-user@lucene.apache.org Date: Friday, December 12, 2008, 9:41 PM DataImportHandler is designed to stream rows one by one to create Solr documents. As long as your database driver supports streaming, you should be fine. Which database are you using? On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay kaykay.uni...@yahoo.com wrote: As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler - I am seeing the following fragment. dataSource driver=org.hsqldb.jdbcDriver url=jdbc:hsqldb:/temp/example/ex user=sa / document name=products entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / .. /entity /document /dataSource My scaled-down application looks very similar along these lines but where my resultset is so big that it cannot fit within main memory by any chance. So I was planning to split this single query into multiple subqueries - with another conditional based on the id . ( id 0 and id 100 , say ) . I am curious if there is any way to specify another conditional clause , (splitData Column = id batch=1 /, where the column is supposed to be an integer value) - and internally , the implementation could actually generate the subqueries - i) get the min , max of the numeric column , and send queries to the database based on the batch size ii) Add Documents for each batch and close the resultset . This might end up putting more load on the database (but at least the dataset would fit in the main memory ). Let me know if anyone else had run into similar issues and how this was encountered. -- Regards, Shalin Shekhar Mangar.
Re: Solr - DataImportHandler - Large Dataset results ?
Thanks Bryan . That clarifies a lot. But even with streaming - retrieving one document at a time and adding to the IndexWriter seems to making it more serializable . So - may be the DataImportHandler could be optimized to retrieve a bunch of results from the query and add the Documents in a separate thread , from a Executor pool (and make this number configurable / may be retrieved from the System as the number of physical cores to exploit maximum parallelism ) since that seems like a bottleneck. Any comments on the same. --- On Fri, 12/12/08, Bryan Talbot btal...@aeriagames.com wrote: From: Bryan Talbot btal...@aeriagames.com Subject: Re: Solr - DataImportHandler - Large Dataset results ? To: solr-user@lucene.apache.org Date: Friday, December 12, 2008, 5:26 PM It only supports streaming if properly enabled which is completely lame: http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and can not allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time. To enable this functionality, you need to create a Statement instance in the following manner: stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE); The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this any result sets created with the statement will be retrieved row-by-row. -Bryan On Dec 12, 2008, at Dec 12, 2:15 PM, Kay Kay wrote: I am using MySQL. I believe (since MySQL 5) supports streaming. On more about streaming - can we assume that when the database driver supports streaming , the resultset iterator is a forward directional iterator. If , say the streaming size is 10K records and we are trying to retrieve a total of 100K records - what exactly happens when the threshold is reached , (say , the first 10K records were retrieved ). Are the previous set of records thrown away and replaced in memory by the new batch of records. --- On Fri, 12/12/08, Shalin Shekhar Mangar shalinman...@gmail.com wrote: From: Shalin Shekhar Mangar shalinman...@gmail.com Subject: Re: Solr - DataImportHandler - Large Dataset results ? To: solr-user@lucene.apache.org Date: Friday, December 12, 2008, 9:41 PM DataImportHandler is designed to stream rows one by one to create Solr documents. As long as your database driver supports streaming, you should be fine. Which database are you using? On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay kaykay.uni...@yahoo.com wrote: As per the example in the wiki - http://wiki.apache.org/solr/DataImportHandler - I am seeing the following fragment. dataSource driver=org.hsqldb.jdbcDriver url=jdbc:hsqldb:/temp/example/ex user=sa / document name=products entity name=item query=select * from item field column=ID name=id / field column=NAME name=name / .. /entity /document /dataSource My scaled-down application looks very similar along these lines but where my resultset is so big that it cannot fit within main memory by any chance. So I was planning to split this single query into multiple subqueries - with another conditional based on the id . ( id 0 and id 100 , say ) . I am curious if there is any way to specify another conditional clause , (splitData Column = id batch=1 /, where the column is supposed to be an integer value) - and internally , the implementation could actually generate the subqueries - i) get the min , max of the numeric column , and send queries to the database based on the batch size ii) Add Documents for each batch and close the resultset . This might end up putting more load on the database (but at least the dataset would fit in the main memory ). Let me know if anyone else had run into similar issues and how this was encountered. --Regards, Shalin Shekhar Mangar.
Re: Solr - DataImportHandler - Large Dataset results ?
Thanks Shalin for the clarification. The case about Lucene taking more time to index the Document when compared to DataImportHandler creating the input is definitely intuitive. But just curious about the underlying architecture on which the test was being run. Was this performed on a multi-core machine . If so - how many cores were there ? What architecture would they be ? It might be useful to know more about them to understand more about the results and see where they could be improved. As about the query - select * from table LIMIT 0, 5000 how database / vendor / driver neutral is this statement . I believe mysql supports this. But I am just curious how generic is this statement going to be . Shalin Shekhar Mangar wrote: On Sat, Dec 13, 2008 at 4:51 AM, Kay Kay kaykay.uni...@yahoo.com wrote: Thanks Bryan . That clarifies a lot. But even with streaming - retrieving one document at a time and adding to the IndexWriter seems to making it more serializable . We have experimented with making DataImportHandler multi-threaded in the past. We found that the improvement was very small (5-10%) because, with databases on the local network, the bottleneck is Lucene's ability to index documents rather than DIH's ability to create documents. Since that made the implementation much more complex, we did not go with it. So - may be the DataImportHandler could be optimized to retrieve a bunch of results from the query and add the Documents in a separate thread , from a Executor pool (and make this number configurable / may be retrieved from the System as the number of physical cores to exploit maximum parallelism ) since that seems like a bottleneck. For now, you can try creating multiple root entities with LIMIT clause to fetch rows in batches. For example: entity name=first query=select * from table LIMIT 0, 5000 /entity entity name=second query=select * from table LIMIT 5000, 1 ... /entity and so on. An alternate solution would be to use request parameters as variables in the LIMIT clause and call DIH full import with different start and offset. For example: entity name=x query=select * from x LIMIT ${dataimporter.request.startAt}, ${dataimporter.request.count} ... /entity Then call: http://host:port/solr/dataimport?command=full-importstartAt=0count=5000 Wait for it to complete import (you'll have to monitor the output to figure out when the import ends), and then call: http://host:port /solr/dataimport?command=full-importstartAt=5000count=1 and so on. Note, start and rows are parameters used by DIH, so don't use these parameter names. I guess this will be more complex than using multiple root entities. Any comments on the same. A workaround for the streaming bug with MySql JDBC driver is detailed here: http://wiki.apache.org/solr/DataImportHandlerFaq If you try any of these tricks, do let us know if it improves the performance. If there is something which gives a lot of improvement, we can figure out ways to implement them inside DataImportHandler itself.
Stopping / Starting IndexReaders in Solr 1.3+
For a particular application of ours - we need to suspend the Solr server from doing any query operation ( IndexReader-s) for sometime, and then after sometime in the near future ( in minutes ) - reinitialize / warm IndexReaders once again and get moving. It is a little bit different from optimize since this server is only supposed to read the data and not add create segments . But we want to suspend it as an initial test case for one of our load balancers. (Restarting Solr is an option though we want to get to that as a last resort ).
Re: Solr - DataImportHandler - Large Dataset results ?
Shalin Shekhar Mangar wrote: On Sat, Dec 13, 2008 at 11:03 AM, Kay Kay kaykay.uni...@gmail.com wrote: Thanks Shalin for the clarification. The case about Lucene taking more time to index the Document when compared to DataImportHandler creating the input is definitely intuitive. But just curious about the underlying architecture on which the test was being run. Was this performed on a multi-core machine . If so - how many cores were there ? What architecture would they be ? It might be useful to know more about them to understand more about the results and see where they could be improved. This was with 4 CPU 64-bit Xeon dual core boxes with 6GB dedicated to the JVM. IIRC, dataset was 3 million documents joining 3 tables from MySQL (index size on disk 1.3 gigs). Both Solr and MySql boxes were same configuration and running on a gigabit network. This was done a long time back so these may not be the exact values but should be pretty close. Thanks for the detailed configuration on which the tests were performed. Our current architecture also looks more or less very similar to the same. As about the query - select * from table LIMIT 0, 5000 how database / vendor / driver neutral is this statement . I believe mysql supports this. But I am just curious how generic is this statement going to be . This is for MySql. I believe we are discussing these workarounds only because MySQL driver does not support batch streaming. It fetches rows either one-by-one or all-at-once. You probably wouldn't need these tricks for other databases. True - Currently , playing around with mysql . But I was trying to understand more about how the Statement object is getting created (in the case of a platform / vendor specific query like this ). Are we going through JPA internally in Solr to create the Statements for the queries. Where can I look into this in Solr source code to understand more about this.