Re: Greater-than and less-than in data import SQL queries
On Mon, Nov 2, 2009 at 11:34 AM, Amit Nithian anith...@gmail.com wrote: A thought I had on this from a DIH design perspective. Would it be better to have the SQL queries stored in an element rather than an attribute so that you can wrap it in a CDATA block without having to mess up the look of query with lt, gt? Makes debugging easier (I know find and replace is trivial but it can be annoying when debugging SQL issues :-)). Actually most of the parsers are forgiving in this aspect. I mean '' and '' are ok in the xml parser shipped with the jdk. On Wed, Oct 28, 2009 at 5:15 PM, Lance Norskog goks...@gmail.com wrote: It is easier to put SQL select statements in a view, and just use that view from the DIH configuration file. On Tue, Oct 27, 2009 at 12:30 PM, Andrew Clegg andrew.cl...@gmail.com wrote: Heh, eventually I decided where 4 node_depth was the most pleasing (if slightly WTF-ish) way of writing it... Cheers, Andrew. Erik Hatcher-4 wrote: Use lt; instead of in that attribute. That should fix the issue. Remember, it's an XML file, so it has to obey XML encoding rules which make it ugly but whatcha gonna do? Erik On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote: Hi, If I have a DataImportHandler query with a greater-than sign in, like this: entity name=higher_node dataSource=database query=select *, title as keywords from cathnode_text where node_depth 4 Everything's fine. However, if it contains a less-than sign: entity name=higher_node dataSource=database query=select *, title as keywords from cathnode_text where node_depth 4 I get this exception: INFO: Processing configuration from solrconfig.xml: {config=dataconfig.xml} [Fatal Error] :240:129: The value of attribute query associated with an element type null must not contain the '' character. 27-Oct-2009 15:30:49 org.apache.solr.handler.dataimport.DataImportHandler inform SEVERE: Exception while loading DataImporter org.apache.solr.handler.dataimport.DataImportHandlerException: Exception occurred while initializing context at org .apache .solr .handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:184) at org .apache .solr.handler.dataimport.DataImporter.init(DataImporter.java:101) at org .apache .solr .handler.dataimport.DataImportHandler.inform(DataImportHandler.java: 113) at org .apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java: 424) at org.apache.solr.core.SolrCore.init(SolrCore.java:588) at org.apache.solr.core.CoreContainer $Initializer.initialize(CoreContainer.java:137) at org .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 83) at org .apache .catalina .core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java: 275) at org .apache .catalina .core .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java: 397) at org .apache .catalina .core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:108) at org .apache .catalina.core.StandardContext.filterStart(StandardContext.java:3709) at org.apache.catalina.core.StandardContext.start(StandardContext.java: 4356) at org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java: 1244) at org .apache .catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java: 604) at org .apache .catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java: 129) at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org .apache .catalina .core .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 290) at org .apache .catalina .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org .apache .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 233) at org .apache .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 175) at org .apache .catalina .authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: 568) at org .apache .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org .apache .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org .apache .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 109) at org .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 286) at org .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 844) at
Problems downloading lucene 2.9.1
Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? Thx -- Lici
RE: CPU utilization and query time high on Solr slave when snapshot install
Hi Solr Gurus, We have solr in 1 master, 2 slave configuration. Snapshot is created post commit, post optimization. We have autocommit after 50 documents or 5 minutes. Snapshot puller runs as a cron every 10 minutes. What we have observed is that whenever snapshot is installed on the slave, we see solrj client used to query slave solr, gets timedout and there is high CPU usage/load avg. on slave server. If we stop snapshot puller, then slaves work with no issues. The system has been running since 2 months and this issue has started to occur only now when load on website is increasing. Following are some details: Solr Details: apache-solr Version: 1.3.0 Lucene - 2.4-dev Master/Slave configurations: Master: - for indexing data HTTPRequests are made on Solr server. - autocommit feature is enabled for 50 docs and 5 minutes - caching params are disable for this server - mergeFactor of 10 is set - we were running optimize script after every 2 hours, but now have reduced the duration to twice a day but issue still persists Slave1/Slave2: - standard requestHandler is being used - default values of caching are set Machine Specifications: Master: - 4GB RAM - 1GB JVM Heap memory is allocated to Solr Slave1/Slave2: - 4GB RAM - 2GB JVM Heap memory is allocated to Solr Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. We use HAProxy to load balance query requests between 2 slaves. Master is only used for indexing. Please let us know if somebody has ever faced similar kind of issue or has some insight into it as we guys are literally struck at the moment with a very unstable production environment. As a workaround, we have started running optimize on master every 7 minutes. This seems to have reduced the severity of the problem but still issue occurs every 2days now. please suggest what could be the root cause of this. Thanks, Bipul
Re: Indexing multiple entities
I'm using a code generator for my entities, and I cannot modify the generation. I need to work out another option :( shouldn't code generators help development and not make it more complex and difficult? oO (sry off topic) chantal
Re: StreamingUpdateSolrServer - indexing process stops in a couple of hours
I'm able to reproduce this issue consistently using JDK 1.6.0_16 After an optimize is called, only one thread keeps adding documents and the rest wait on StreamingUpdateSolrServer line 196. On Sun, Oct 25, 2009 at 8:03 AM, Dadasheva, Olga olga_dadash...@harvard.edu wrote: I am using java 1.6.0_05 To illustrate what is happening I wrote this test program that has 10 threads adding a collection of documents and one thread optimizing the index every 10 sec. I am seeing that after the first optimize there is only one thread that keeps adding documents. The other ones are locked. In the real code I ended up adding synchronized around add on optimize to avoid this. public static void main(String[] args) { final JettySolrRunner jetty = new JettySolrRunner(/solr, 8983 ); try { jetty.start(); // setup the server... String url = http://localhost:8983/solr;; final StreamingUpdateSolrServer server = new StreamingUpdateSolrServer( url, 2, 5 ) { @Override public void handleError(Throwable ex) { // do somethign... } }; server.setConnectionTimeout(1000); server.setDefaultMaxConnectionsPerHost(100); server.setMaxTotalConnections(100); int i = 0; while (i++ 10) { new Thread(add-thread+i) { public void run(){ int j = 0; while (true) { try { ListSolrInputDocument docs = new ArrayListSolrInputDocument(); for (int n = 0; n 50; n++) { SolrInputDocument doc = new SolrInputDocument(); String docID = this.getName()+_doc_+j++; doc.addField( id, docID); doc.addField( content, document_+docID); docs.add(doc); } server.add(docs); System.out.println(this.getName()+ added +docs.size()+ documents); Thread.sleep(100); } catch (Exception e) { e.printStackTrace(); System.err.println(this.getName()+ +e.getLocalizedMessage()); System.exit(0); } } } }.start(); } new Thread(optimizer-thread) { public void run(){ while (true) { try { Thread.sleep(1); server.optimize(); System.out.println(this.getName()+ optimized); } catch (Exception e) { e.printStackTrace(); System.err.println(optimizer +e.getLocalizedMessage()); System.exit(0); } } } }.start(); } catch (Exception e) { e.printStackTrace(); } } -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Tuesday, October 13, 2009 8:59 PM To: solr-user@lucene.apache.org Subject: Re: StreamingUpdateSolrServer - indexing process stops in a couple of hours Which Java release is this? There are known thread-blocking problems in Java 1.5. Also, what sockets are used during this time? Try 'netstat -s | fgrep 8983' (or your Solr URL port #) and watch the active, TIME_WAIT, CLOSE_WAIT sockets build up. This may give a hint. On Tue, Oct 13, 2009 at 8:47 AM, Dadasheva, Olga olga_dadash...@harvard.edu wrote: Hi, I am indexing documents using StreamingUpdateSolrServer. My 'setup' code is almost a copy of the junit test of the Solr trunk. try { StreamingUpdateSolrServer streamingServer = new StreamingUpdateSolrServer( url, 2, 5 ) { @Override public void handleError(Throwable ex) { System.out.println( new StreamingUpdateSolrServer error +ex);
Lock problems: Lock obtain timed out
Hi, I've got a few machines who post documents concurrently to a solr instance. They do not issue the commit themselves, instead, I've got autocommit set up at solr server side: autoCommit maxDocs5/maxDocs !-- commit at least every 5 docs -- maxTime6/maxTime !-- Stays max 60s without commit -- /autoCommit This usually works fine, but sometime the server goes in a deadlock state . Here's the errors I get from the log (these go on forever until I delete the index and restart all from zero): 02-Nov-2009 10:35:27 org.apache.solr.update.SolrIndexWriter finalize SEVERE: SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! ... [ multiple messages like this ] ... 02-Nov-2009 10:35:27 org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/home/solrdata/jobs/index/lucene-703db99881e56205cb910a2e5fd816d3-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:85) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1538) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1395) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:190) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:220) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) I'm wondering what could be the reason for this (if a commit takes mire than 60 seconds for instance?), and if I should use better locking or autocommittting options? Here's the locking conf I've got at the moment: writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypenative/lockType I'm using solr trunk from 12 oct 2009 within tomcat. Thanks for any help. Jerome. -- Jerome Eteve. http://www.eteve.net jer...@eteve.net
Re: Spell check suggestion and correct way of implementation and some Questions
On Wed, Oct 28, 2009 at 8:57 PM, darniz rnizamud...@edmunds.com wrote: Question. Should i build the dictionlary only once and after that as new words are indexed the dictionary will be updated. Or i to do that manually over certain interval. No. The dictionary is built only when spellcheck.build=true is specified as a request parameter. You will need to explicitly send spellcheck.build=true again when the document changes or you can use the buildOnCommit or buildOnOptimize parameters to re-build the index automatically. http://wiki.apache.org/solr/SpellCheckComponent#Building_on_Commits add the spellcheck component to the handler in my case as of now standard requets handler. I might also start adding some more dismax handlers depending on my requirement requestHandler name=standard class=solr.SearchHandler default=true !-- default values for query parameters -- lst name=defaults str name=echoParamsexplicit/str !-- int name=rows10/int str name=fl*/str str name=version2.1/str -- /lst arr name=last-components strspellcheck/str /arr /requestHandler run the query with parameter spell.check=true, and also specify against which dictionary you want to run spell check again in my case my spellcheck.dictionary parameter is mySpellChecker. The parameter is spellcheck=true not spell.check=true. If you do not give a name to your dictionary then you do not need to add the spellcheck.dictionary parameter. -- Regards, Shalin Shekhar Mangar.
tracking solr response time
Hi, We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Thanks in advance .. Bharath
Re: Problems downloading lucene 2.9.1
On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote: Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? They won't be there until 2.9.1 is officially released. We are trying to speed up the Solr release by piggybacking on the Lucene release, but this little bit is the one downside. -Grant
NullPointerException with TermVectorComponent
Hi, I've recently added the TermVectorComponent as a separate handler, following the example in the supplied config file, i.e.: searchComponent name=tvComponent class=org.apache.solr.handler.component.TermVectorComponent/ requestHandler name=/tvrh class=org.apache.solr.handler.component.SearchHandler lst name=defaults bool name=tvtrue/bool /lst arr name=last-components strtvComponent/str /arr /requestHandler It works, but with one quirk. When you use tf.all=true, you get the tf*idf scores in the output, just fine (along with tf and df). But if you use tv.tf_idf=true you get an NPE: http://server:8080/solr/tvrh/?q=1cukversion=2.2indent=ontv.tf_idf=true HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(TermVectorComponent.java:253) at org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorComponent.java:245) at org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:522) at org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:401) at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378) at org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1253) at org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:474) at org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:244) at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:125) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at (etc.) Is this a bug, or am I doing it wrong? Cheers, Andrew. -- View this message in context: http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: tracking solr response time
On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com
Re: tracking solr response time
On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options: There is a method called getElapsedTime() in org.apache.solr.client.solrj.response.SolrResponseBase which is available to all the subclasses I have not used it personally but I think this should return the time spent on the client side for that request. The QTime is not the time on the client side but the time spent internally at the Solr server to process the request. http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html Most likely it could be as a result of an internal network issue between the two servers or the Solr server is competing with other applications for resources. What operating system is the Solr server running on? Is you client application connection to a Solr server on the same network or over the internet? Are there other applications like database servers etc running on the same machine? If so, then the DB server (or any other application) and the Solr server could be competing for resources like CPU, memory etc. If you are using Tomcat, you can take a look in $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also guide you. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: tracking solr response time
On Nov 2, 2009, at 5:41 AM, Yonik Seeley wrote: QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. It might be nice to add a flag to DebugComponent to spit out timings only. Thus, one could skip the explains, etc. and just see the timings. Seems like that would have pretty low overhead and still see the timings.Î
Re: NullPointerException with TermVectorComponent
I think it might be to do with the library itself I downloaded semanticvectors-1.22 and compiled from source. Then created a demo corpus using java org.apache.lucene.demo.IndexFiles against the lucene src directory I then ran a java pitt.search.semanticvectors.BuildIndex against the index and got the following Seedlength = 10 Dimension = 200 Minimum frequency = 0 Number non-alphabet characters = 0 Contents fields are: [contents] Creating semantic term vectors ... Populating basic sparse doc vector store, number of vectors: 774 Creating store of sparse vectors ... Created 774 sparse random vectors. Creating term vectors ... There are 36881 terms (and 774 docs) 0 ... 1000 ... 2000 ... 3000 ... 4000 ... Exception in thread main java.lang.NullPointerException at org.apache.lucene.index.DirectoryReader$MultiTermDocs.freq(DirectoryReader.java: 1068) at pitt.search.semanticvectors.LuceneUtils.getGlobalTermFreq(LuceneUtils.java:70) at pitt.search.semanticvectors.LuceneUtils.termFilter(LuceneUtils.java:187) at pitt.search.semanticvectors.TermVectorsFromLucene.init(TermVectorsFromLucene.j ava:163) at pitt.search.semanticvectors.BuildIndex.main(BuildIndex.java:138) I am still digging but when you look at the source code it references lucene call dating back to lucene 2.4 alot fo which are deprecated might need some refreshing. Cheers, Dave On 02 November 2009 at 14:40 Andrew Clegg andrew.cl...@gmail.com wrote: Hi, I've recently added the TermVectorComponent as a separate handler, following the example in the supplied config file, i.e.: searchComponent name=tvComponent class=org.apache.solr.handler.component.TermVectorComponent/ requestHandler name=/tvrh class=org.apache.solr.handler.component.SearchHandler lst name=defaults bool name=tvtrue/bool /lst arr name=last-components strtvComponent/str /arr /requestHandler It works, but with one quirk. When you use tf.all=true, you get the tf*idf scores in the output, just fine (along with tf and df). But if you use tv.tf_idf=true you get an NPE: http://server:8080/solr/tvrh/?q=1cukversion=2.2indent=ontv.tf_idf=true HTTP Status 500 - null java.lang.NullPointerException at org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(Term VectorComponent.java:253) at org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorC omponent.java:245) at org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.jav a:522) at org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.ja va:401) at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378) at org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:125 3) at org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java :474) at org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java: 244) at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorCompon ent.java:125) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandle r.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja va:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338 ) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:24 1) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi lterChain.java:235) at (etc.) Is this a bug, or am I doing it wrong? Cheers, Andrew. -- View this message in context: http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems downloading lucene 2.9.1
On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote: On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote: Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? They won't be there until 2.9.1 is officially released. We are trying to speed up the Solr release by piggybacking on the Lucene release, but this little bit is the one downside. Until then, you can add a repo to: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/
Re: adding and updating a lot of document to Solr, metadata extraction etc
Hi Eugene, - ability to iterate over all documents, returned in search, as Lucene does provide within a HitCollector instance. We would need to extract and aggregate various fields, stored in index, to group results and aggregate them in some way. Also I did not find any way in the tutorial to access the search results with all fields to be processed by our application. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr Check out Faceted Search, probably you can achieve your goal by using Facet Component There's also Field Collapsing patch http://wiki.apache.org/solr/FieldCollapsing Alex
RE: Solr YUI autocomplete
Hey Amit, My index(ie Solr) was on different domain, so I can't use XHR(as XHR doesnot work with cross domain proxyless data fetch). I tried using YUI's DS_ScriptNode but didn't work. I completed my task by using jQuery and it worked well with solr. -Ankit -Original Message- From: Amit Nithian [mailto:anith...@gmail.com] Sent: Monday, November 02, 2009 1:00 AM To: solr-user@lucene.apache.org Subject: Re: Solr YUI autocomplete I've used the YUI auto complete (albeit not with Solr which shouldn't matter here) and it should work with JSON. I did one that simply made XHR calls over to a method on my server which returned pipe delimited text which worked fine. Are you using the XHR Data source and if so, what type are you telling it to expect. One of the examples on the YUI site is text based and i'm sure you can specify TYPE_JSON or JS_ARRAY too. - Amit On Fri, Oct 30, 2009 at 7:04 AM, Ankit Bhatnagar abhatna...@vantage.comwrote: Does Solr supports JSONP (JSON with Padding) in the response? -Ankit -Original Message- From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] Sent: Friday, October 30, 2009 10:27 AM To: 'solr-user@lucene.apache.org' Subject: Solr YUI autocomplete Hi Guys, I have question regarding - how to specify the I am using YUI autocomplete widget and it expects the JSONP response. http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf= I am not sure how should I specify the json.wrf=function Thanks Ankit
question about collapse.type = adjacent
Hi, I would like to confirm if 'adjacent' in collapse.type means the documents (with the same collapse field value) are considered adjacent *after* the 'sort' param from the query has been applied, or *before*? I would think it would be *after* since collapse feature primarily is meant for presentation use. Thanks, Michael -- View this message in context: http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: tracking solr response time
Thanks for the quick response @yonik How much of a latency compared to normal, and what version of Solr are you using? latency is usually around 2-4 secs (some times it goes more than that ) which happens to only 15-20% of the request other 80-85% of request are very fast it is in milli secs ( around 200,000 requests happens every day ) @Israel we are not using java client .. we r using python at the client with response formatted in json @yonikn @Israel does qtime measure the total time taken at the solr server ? I am already measuring the time to get the response at client end . I would want a means to know how much time the solr server is taking to respond (process ) once it gets the request . so that I could identify whether it is a solr server issue or internal network issue @Israel we are using rhel server 5 on both client and server .. we have 6 solr sever . one is acting as master . both client and solr sever are on the same network . those servers are dedicated solr server except 2 severs which have DB and memcahce running .. we have adjusted the load accordingly On 11/2/09, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options: There is a method called getElapsedTime() in org.apache.solr.client.solrj.response.SolrResponseBase which is available to all the subclasses I have not used it personally but I think this should return the time spent on the client side for that request. The QTime is not the time on the client side but the time spent internally at the Solr server to process the request. http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html Most likely it could be as a result of an internal network issue between the two servers or the Solr server is competing with other applications for resources. What operating system is the Solr server running on? Is you client application connection to a Solr server on the same network or over the internet? Are there other applications like database servers etc running on the same machine? If so, then the DB server (or any other application) and the Solr server could be competing for resources like CPU, memory etc. If you are using Tomcat, you can take a look in $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also guide you. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
Re: Solr YUI autocomplete
It does, have you looked at http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX. Also, in my book on Solr, there is an example, but using the jquery autocomplete, which I think was answered earlier on the thread! Hope that helps. ANKITBHATNAGAR wrote: Does Solr supports JSONP (JSON with Padding) in the response? -Ankit -Original Message- From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] Sent: Friday, October 30, 2009 10:27 AM To: 'solr-user@lucene.apache.org' Subject: Solr YUI autocomplete Hi Guys, I have question regarding - how to specify the I am using YUI autocomplete widget and it expects the JSONP response. http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf= I am not sure how should I specify the json.wrf=function Thanks Ankit -- View this message in context: http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cell on web-based files?
e.g (doesn't work) curl http://localhost:8983/solr/update/extract?extractOnly=true --data-binary @http://myweb.com/mylocalfile.htm -H Content-type:text/html You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml). Yes, curl example curl 'http://localhost:8080/solr/main_index/extract/?extractOnly=trueindent=onresource.name=lecture12stream.url=http%3A//myweb.com/lecture12.ppt' It works great for me. Alex
RE: Solr YUI autocomplete
Hey Eric, That correct however it didn't work with YUI widget. I changed my approach to use jQuery for now. -Ankit -Original Message- From: Eric Pugh [mailto:ep...@opensourceconnections.com] Sent: Monday, November 02, 2009 10:20 AM To: solr-user@lucene.apache.org Subject: Re: Solr YUI autocomplete It does, have you looked at http://wiki.apache.org/solr/SolJSON?highlight=%28json%29#Using_Solr.27s_JSON_output_for_AJAX. Also, in my book on Solr, there is an example, but using the jquery autocomplete, which I think was answered earlier on the thread! Hope that helps. ANKITBHATNAGAR wrote: Does Solr supports JSONP (JSON with Padding) in the response? -Ankit -Original Message- From: Ankit Bhatnagar [mailto:abhatna...@vantage.com] Sent: Friday, October 30, 2009 10:27 AM To: 'solr-user@lucene.apache.org' Subject: Solr YUI autocomplete Hi Guys, I have question regarding - how to specify the I am using YUI autocomplete widget and it expects the JSONP response. http://localhost:8983/solr/select/?q=monitorversion=2.2start=0rows=10indent=onwt=jsonjson.wrf= I am not sure how should I specify the json.wrf=function Thanks Ankit -- View this message in context: http://old.nabble.com/JQuery-and-autosuggest-tp26130209p26157130.html Sent from the Solr - User mailing list archive at Nabble.com.
storing other files in index directory
Are there any pitfalls to storing an arbitrary text file in the same directory as the solr index? We're slinging different versions of the index around while we're testing and it's hard to keep them straight. I'd like to put a readme.txt file in the directory that contains some history about how that index came to be. Is that harmless? Will it be ignored by solr, including during optimizations and any other operation, and will solr not delete it?
Re: tracking solr response time
Also, how about a sample of a fast and slow query? And is a slow query only slow the first time it's executed or every time? Best Erick On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: Thanks for the quick response @yonik How much of a latency compared to normal, and what version of Solr are you using? latency is usually around 2-4 secs (some times it goes more than that ) which happens to only 15-20% of the request other 80-85% of request are very fast it is in milli secs ( around 200,000 requests happens every day ) @Israel we are not using java client .. we r using python at the client with response formatted in json @yonikn @Israel does qtime measure the total time taken at the solr server ? I am already measuring the time to get the response at client end . I would want a means to know how much time the solr server is taking to respond (process ) once it gets the request . so that I could identify whether it is a solr server issue or internal network issue @Israel we are using rhel server 5 on both client and server .. we have 6 solr sever . one is acting as master . both client and solr sever are on the same network . those servers are dedicated solr server except 2 severs which have DB and memcahce running .. we have adjusted the load accordingly On 11/2/09, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options: There is a method called getElapsedTime() in org.apache.solr.client.solrj.response.SolrResponseBase which is available to all the subclasses I have not used it personally but I think this should return the time spent on the client side for that request. The QTime is not the time on the client side but the time spent internally at the Solr server to process the request. http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html Most likely it could be as a result of an internal network issue between the two servers or the Solr server is competing with other applications for resources. What operating system is the Solr server running on? Is you client application connection to a Solr server on the same network or over the internet? Are there other applications like database servers etc running on the same machine? If so, then the DB server (or any other application) and the Solr server could be competing for resources like CPU, memory etc. If you are using Tomcat, you can take a look in $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also guide you. -- Good Enough is not good enough. To give anything less than your best is to sacrifice the gift. Quality First. Measure Twice. Cut Once.
tokenize after filters
is it possible to tokenize a field on whitespace after some filters have been applied: ex: A + W Root Beer the field uses a keyword tokenizer to keep the string together, then it will get converted to aw root beer by a custom filter ive made, i now want to split that up into 3 tokens (aw, root, beer), but seems like you cant use a tokenizer after a filter ... so whats the best way of accomplishing this? thx much --joe
Re: Annotations and reference types
On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer marc-...@gmx.net wrote: Dear listusers, Is there a way to store an instance of class A (including the fields from myB) via solr using annotations ? The index should look like : id; name; b_id; b_name -- Class A { @Field private String id; @Field private String name; @Field private B myB; } -- Class B { @Field(b_id) private String id; @Field(B_name) private String name; } No. I guess you want to represent certain fields in class B and have them as an attribute in Class A (but all fields belong to the same schema), then it can be a worthwhile addition to Solrj. Can you open an issue? A patch would be even better :) -- Regards, Shalin Shekhar Mangar.
Re: Question about DIH execution order
Hi Noble, I tried to understand your suggestions and played different variations according to your reply. But none of them work. Can you explain it in more details? Thanks a lot! BTW, do you mean your solution as follows? document entity name=Course transformer= TemplateTransformer query=select * from Course field column=TmpCourseId name=CourseId template=Course:${Course.CourseId} name=id/ entity name=Rating query=select comment from Rating where Rating.CourseId = ${Course.CourseId} field column=comment name=review/ /entity /entity /document But 1) There is no TmpCourseId field column. 2) Can we put two name CourseId and id in the same map? It seems not. 2009/11/1 Noble Paul നോബിള് नोब्ळ् noble.p...@corp.aol.com On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen bertie.s...@gmail.com wrote: Hi folks, I have the following data-config.xml. Is there a way to let transformation take place after executing SQL select comment from Rating where Rating.CourseId = ${Course.CourseId}? In MySQL database, column CourseId in table Course is integer 1, 2, etc; template transformation will make them like Course:1, Course:2; column CourseId in table Rating is also integer 1, 2, etc. If transformation happens before executing select comment from Rating where Rating.CourseId = ${Course.CourseId}, then there will no match for the SQL statement execution. document entity name=Course transformer=TemplateTransformer query=select * from Course field column=CourseId template=Course:${Course.CourseId} name=id/ entity name=Rating query=select comment from Rating where Rating.CourseId = ${Course.CourseId} field column=comment name=review/ /entity /entity /document keep the field as follows field column=TmpCourseId name=CourseId template=Course:${Course.CourseId} name=id/ -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: tracking solr response time
On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: Thanks for the quick response @yonik How much of a latency compared to normal, and what version of Solr are you using? latency is usually around 2-4 secs (some times it goes more than that ) which happens to only 15-20% of the request other 80-85% of request are very fast it is in milli secs ( around 200,000 requests happens every day ) @Israel we are not using java client .. we r using python at the client with response formatted in json @yonikn @Israel does qtime measure the total time taken at the solr server ? I am already measuring the time to get the response at client end . I would want a means to know how much time the solr server is taking to respond (process ) once it gets the request . so that I could identify whether it is a solr server issue or internal network issue It is the time spent at the Solr server. I think Yonik already answered this part in his response to your thread : This is what he said : QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). @Israel we are using rhel server 5 on both client and server .. we have 6 solr sever . one is acting as master . both client and solr sever are on the same network . those servers are dedicated solr server except 2 severs which have DB and memcahce running .. we have adjusted the load accordingly On 11/2/09, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options: There is a method called getElapsedTime() in org.apache.solr.client.solrj.response.SolrResponseBase which is available to all the subclasses I have not used it personally but I think this should return the time spent on the client side for that request. The QTime is not the time on the client side but the time spent internally at the Solr server to process the request. http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/SolrResponseBase.html http://lucene.apache.org/solr//api/solrj/org/apache/solr/client/solrj/response/QueryResponse.html Most likely it could be as a result of an internal network issue between the two servers or the Solr server is competing with other applications for resources. What operating system is the Solr server running on? Is you client application connection to a Solr server on the same network or over the internet? Are there other applications like database servers etc running on the same machine? If so, then the DB server (or any other application) and the Solr server could be competing for resources like CPU, memory etc. If you are using Tomcat, you can take a look in $CATALINA_HOME/logs/catalina.out, there are timestamps there that can also guide you. -- Good Enough is not good enough. To give
Re: CPU utilization and query time high on Solr slave when snapshot install
If you are going to pull a new index every 10 minutes, try turning off cache autowarming. Your caches are never more than 10 minutes old, so spending a minute warming each new cache is a waste of CPU. Autowarm submits queries to the new Searcher before putting it in service. This will create a burst of query load on the new Searcher, often keeping one CPU pretty busy for several seconds. In solrconfig.xml, set autowarmCount to 0. Also, if you want the slaves to always have an optimized index, create the snapshot only in post-optimize. If you create snapshots in both post-commit and post-optimize, you are creating a non-optimized index (post-commit), then replacing it with an optimized one a few minutes later. A slave might get a non-optimized index one time, then an optimized one the next. wunder On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote: Hi Solr Gurus, We have solr in 1 master, 2 slave configuration. Snapshot is created post commit, post optimization. We have autocommit after 50 documents or 5 minutes. Snapshot puller runs as a cron every 10 minutes. What we have observed is that whenever snapshot is installed on the slave, we see solrj client used to query slave solr, gets timedout and there is high CPU usage/load avg. on slave server. If we stop snapshot puller, then slaves work with no issues. The system has been running since 2 months and this issue has started to occur only now when load on website is increasing. Following are some details: Solr Details: apache-solr Version: 1.3.0 Lucene - 2.4-dev Master/Slave configurations: Master: - for indexing data HTTPRequests are made on Solr server. - autocommit feature is enabled for 50 docs and 5 minutes - caching params are disable for this server - mergeFactor of 10 is set - we were running optimize script after every 2 hours, but now have reduced the duration to twice a day but issue still persists Slave1/Slave2: - standard requestHandler is being used - default values of caching are set Machine Specifications: Master: - 4GB RAM - 1GB JVM Heap memory is allocated to Solr Slave1/Slave2: - 4GB RAM - 2GB JVM Heap memory is allocated to Solr Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. We use HAProxy to load balance query requests between 2 slaves. Master is only used for indexing. Please let us know if somebody has ever faced similar kind of issue or has some insight into it as we guys are literally struck at the moment with a very unstable production environment. As a workaround, we have started running optimize on master every 7 minutes. This seems to have reduced the severity of the problem but still issue occurs every 2days now. please suggest what could be the root cause of this. Thanks, Bipul
Re: tracking solr response time
@Israel: yes I got that point which yonik mentioned .. but is qtime the total time taken by solr server for that request or is it part of time taken by the solr for that request ( is there any thing that a solr server does for that particulcar request which is not included in that qtime bracket ) ? I am sorry for dragging in to this qtime. I just want to be sure, as we observed many times there is huge mismatch between qtime and time measured at the client for the response ( does this imply it is due to internal network issue ) @Erick: yes, many times query is slow first time its executed is there any solution to improve upon this factor .. for querying we use DisMaxRequestHandler , queries are quite long with many faceting parameters . On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: Thanks for the quick response @yonik How much of a latency compared to normal, and what version of Solr are you using? latency is usually around 2-4 secs (some times it goes more than that ) which happens to only 15-20% of the request other 80-85% of request are very fast it is in milli secs ( around 200,000 requests happens every day ) @Israel we are not using java client .. we r using python at the client with response formatted in json @yonikn @Israel does qtime measure the total time taken at the solr server ? I am already measuring the time to get the response at client end . I would want a means to know how much time the solr server is taking to respond (process ) once it gets the request . so that I could identify whether it is a solr server issue or internal network issue It is the time spent at the Solr server. I think Yonik already answered this part in his response to your thread : This is what he said : QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). @Israel we are using rhel server 5 on both client and server .. we have 6 solr sever . one is acting as master . both client and solr sever are on the same network . those servers are dedicated solr server except 2 severs which have DB and memcahce running .. we have adjusted the load accordingly On 11/2/09, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options: There is a method called getElapsedTime() in org.apache.solr.client.solrj.response.SolrResponseBase which is available to all the subclasses I have not used it personally but I think this should return the time spent on the client side for that request. The QTime is not the time on the client side but the time spent internally at the Solr server to process the request.
RE: Lucene FieldCache memory requirements
Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
Re: Lucene FieldCache memory requirements
Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
LocalSolr, Maven, build files and release candidates (Just for info) and spatial radius (A question)
Hallo All. I've been trying to prepare a project using localsolr for the impending (I hope) arrival of solr 1.4 and Lucene 2.9.1.. Here are some notes in case anyone else is suffering similarly. Obviously everything here may change by next week. First problem has been the lack of any stable maven based lucene and solr artifacts to wire into my poms. Because of that, and as an interim only measure, I've built the latest branches of the lucene 2.9.1 and solr 1.4 trees and made them into a *temporary* maven repository at http://developer.k-int.com/m2snapshots/. in there you can find all the jar artifacts tagged as xxx-ki-rc1 (For solr) and xxx-ki-rc3 (For lucene) and finally, a localsolr.localsolr build tagged as 1.5.2-rc1. Sorry for the naming, but I don't want these artifacts to clash with the real ones when they come along. This is really just for my own use, but I've seen messages and spoken to people who are really struggling to get their maven deps right, if this helps anyone, please feel free to use these until the real apache artifacts appear. I can't take any responsibility for their quality. All the poms have been altered to look for the correct dependent artifacts in the same repository, adding the stanza !-- Emergency repository for storing interim builds of lucene and solr whilst they sort their act out -- repositories repository idk-int-m2-snapshots/id nameK-int M2 Snapshots/name urlhttp://developer.k-int.com/m2snapshots/url releases enabledtrue/enabled /releases /repository /repositories to your pom will let you use these deps temporarily until we see an official build. If you're a maven developer and I've gone way around the houses with this, please tell me of an easier solution :) This repo *will* go away when the real builds turn up. The localsolr in this repo also contains the patches I've submitted (A good while ago) to the localsolr project to make it build with the lucene 2.9.1 rc3 as the downloadable dist is currently built against an older 2.9 release that had a different API (IE won't work with the new lucene and solr) All this means that there is a working localsolr build. Second up, I've also seen emails (And seen the exception myself) around asking about the following when trying to get all these revisions working together. java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really a LONG?) There are some threads out there telling you that the Lucene indexes are not binary compatible between versions, but if you're using localsolr, what you really need to know is: 1) Make sure that your schema.xml contains at least the following fieldType defs fieldType name=tdouble class=solr.TrieDoubleField precisionStep=8 omitNorms=true positionIncrementGap=0/ 2) Convert your old solr sdouble fields to tdoubles: field name=lat type=tdouble indexed=true stored=true/ field name=lng type=tdouble indexed=true stored=true/ dynamicField name=_local* type=tdouble indexed=true stored=true/ Pretty sure you would need to rebuild your indexes. Ok, with those changes I managed to get a working spatial search. My only problem now is that the radius param on the command line seems to need to be way bigger than it needs to be in order to find anything. Specifically, if I search with a radius of 220 I get a record back which marks it's geo_distance as 83.76888211666025. Shuffling the radius around ends up that a radius of 205 returns that doc, 204 and it's filtered. I'm going to dig into this now, but if anyone knows about this I'd really appreciate any help. Cheers all, hope this is of use to someone out there, if anyone has corrections/comments I'd really appreciate any info. Best, Ian.
Re: question about collapse.type = adjacent
Hi Micheal, Field collapsing is basicly done in two steps. The first step is to get the uncollapsed sorted (whether it is score or a field value) documents and the second step is to apply the collapse algorithm on the uncollapsed documents. So yes, when specifying collapse.type=adjacent the documents can get collapsed after the sort has been applied, but this also the case when not specifying collapse.type=adjacent I hope this answers your question. Cheers, Martijn 2009/11/2 michael8 mich...@saracatech.com: Hi, I would like to confirm if 'adjacent' in collapse.type means the documents (with the same collapse field value) are considered adjacent *after* the 'sort' param from the query has been applied, or *before*? I would think it would be *after* since collapse feature primarily is meant for presentation use. Thanks, Michael -- View this message in context: http://old.nabble.com/question-about-collapse.type-%3D-adjacent-tp26157114p26157114.html Sent from the Solr - User mailing list archive at Nabble.com. -- Met vriendelijke groet, Martijn van Groningen
apply a patch on solr
Hi, First I like to pardon my novice question on patching solr (1.4). What I like to know is, given a patch, like the one for collapse field, how would one go about knowing what solr source that patch is meant for since this is a source level patch? Wouldn't the exact versions of a set of java files to be patched critical for the patch to work properly? So far what I have done is to pull the latest collapse field patch down from http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and then svn up the latest trunk from http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. Intuitively I was thinking I should be doing svn up to a specific revision/tag instead of just latest. So far everything seems fine, but I just want to make sure I'm doing the right thing and not just being lucky. Thanks, Michael -- View this message in context: http://old.nabble.com/apply-a-patch-on-solr-tp26157826p26157826.html Sent from the Solr - User mailing list archive at Nabble.com.
apply a patch on solr
Hi, First I like to pardon my novice question on patching solr (1.4). What I like to know is, given a patch, like the one for collapse field, how would one go about knowing what solr source that patch is meant for since this is a source level patch? Wouldn't the exact versions of a set of java files to be patched critical for the patch to work properly? So far what I have done is to pull the latest collapse field patch down from http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and then svn up the latest trunk from http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. Intuitively I was thinking I should be doing svn up to a specific revision/tag instead of just latest. So far everything seems fine, but I just want to make sure I'm doing the right thing and not just being lucky. Thanks, Michael -- View this message in context: http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Lucene FieldCache memory requirements
I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
Dismax and Standard Queries together
Hi, I have three fields, business_name, category_name, sub_category_name in my solrconfig file. my query = pet clinic example sub_category_names: Veterinarians, Kennels, Veterinary Clinics Hospitals, Pet Grooming, Pet Stores, Clinics my ideal requirement is dismax searching on a. dismax over three or two fields b. followed by a Boolean match over any one of the field is acceptable. I played around with minimum match attributes, but doesn't seems to be helpful, I guess the dismax requires at-least two fields. The nest queries takes only one qf filed, so it doesn't help much either. Any suggestions will be helpful. Thanks Ram -- View this message in context: http://old.nabble.com/Dismax-and-Standard-Queries-together-tp26157830p26157830.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: tokenize after filters
I think you want Koji Sekiguchi's Char Filters: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=char+filters#Char_Filters Steve -Original Message- From: Joe Calderon [mailto:calderon@gmail.com] Sent: Monday, November 02, 2009 11:25 AM To: solr-user@lucene.apache.org Subject: tokenize after filters is it possible to tokenize a field on whitespace after some filters have been applied: ex: A + W Root Beer the field uses a keyword tokenizer to keep the string together, then it will get converted to aw root beer by a custom filter ive made, i now want to split that up into 3 tokens (aw, root, beer), but seems like you cant use a tokenizer after a filter ... so whats the best way of accomplishing this? thx much --joe
field queries seem slow
I took a look through my Solr logs this weekend and noticed that the longest queries were on particular fields, like author:albert einstein. Is this a result consistent with other setups out there? If not, Is there a trick to make these go faster? I've read up on filter queries and use those when applicable, but they don't really solve all my problems. If anybody wants to take a shot at it but needs to see my solrconfig, etc just let me know. Cheers, Mike
manually creating indices to speed up indexing with app-knowledge
This may seem like a strange question, but here it goes anyway. Im considering the possibility of low-level constructing indices for about 20.000 indexed fields (type sInt) if at all possible . (With indices in this context I mean the inverted indices from term to Documentid just to be 100% complete) These indices have to be recreated each night, along with the normal reindex. Globally it should go something like this (each night) : - documents (consisting of about 20 stored fields and about 10 stored indexed fields) are indexed through the normal 'code-path' (solrJ in my case) - After all docs are persisted (max 200.000) I want to extract the mapping from 'lucene docid' -- 'stored/indexed product key' I believe this should work, because after all docs are persisted the internal docids aren't altered, so the relationship between 'lucene docid' -- 'stored/indexed product key' is invariant from that point forward. (please correct if wrong) - construct the 20.000 inverted indices on such a low enough level that I do not have to go through IndexWriter if possible, so I do not need to construct Documents, I only need to construct the native format of the indices themselves. Ideally this should work on multiple servers so that the indices can be created in parallel and the index-files later simply copied to the index-directory of the master. Basically what it boils down to is that indexing time (a reindex should be done each night) is a big show-stopper at the moment, although we've tried and tested all the more standard optimization tricks techniques, as well as having build a home-grown shard-like indexing strategy which uses 20 pretty big servers in parallel. The 20.000 indexed fields are still simply killing. At the same time the app has a lot of knowledge of the 20.000 indices. - All indices consist of prices (ints) between 0 and 10.000 - and most important: as part of the document construction process the ordening of each of the 20.000 indices is known for all documents that are processed by the document-construction server in question. (This part is needed, and is already performing at light speed) for sake of argument say we have 5 document-construction servers. Each server processes 40.000 documents. Each server has 20.000 ordered indices in its own format readily available for the 40.000 documents it's processing. Something like: LinkedHashMapInteger,SetInteger -- price,{productids} Say we have 20 indexing servers. Each server has to calculate 1.000 indices (totalling the 20.000) We have the 5 doc-construction servers distribute the ordered sub-indices to the correct servers. Each server constructs an index from 5 ordered sub-indices coming from 5 different construction-servers. This can be done efficiently using a mergesort (since the sub-indices are already sorted) All that is missing (oversimplifying here ) is going from the ordered indices in application-format to the index-format of lucene (substituting the productids by the lucene docid's along the way) and stream it to disk. I believe this would quite posisbly give a really big indexing improvement. Is my thinking correct in the steps involved? Do you believe that this indeed would give a big speedup for this specific situation Where would I hook in the SOlr / lucene code to construct the native format? Thanks in advance (and for making it to here) Geert-Jan -- View this message in context: http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: apply a patch on solr
You can see what revision the patch was written for at the top of the patch, it will look like this: Index: org/apache/solr/handler/MoreLikeThisHandler.java === --- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437) +++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy) now check out revision 772437 using the --revision switch in svn, patch away, and then svn up to make sure everything merges cleanly. This is a good guide to follow as well: http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html cheers, -mike On Mon, Nov 2, 2009 at 3:55 PM, michael8 mich...@saracatech.com wrote: Hi, First I like to pardon my novice question on patching solr (1.4). What I like to know is, given a patch, like the one for collapse field, how would one go about knowing what solr source that patch is meant for since this is a source level patch? Wouldn't the exact versions of a set of java files to be patched critical for the patch to work properly? So far what I have done is to pull the latest collapse field patch down from http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch), and then svn up the latest trunk from http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and build. Intuitively I was thinking I should be doing svn up to a specific revision/tag instead of just latest. So far everything seems fine, but I just want to make sure I'm doing the right thing and not just being lucky. Thanks, Michael -- View this message in context: http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html Sent from the Solr - User mailing list archive at Nabble.com.
highlighting error using 1.4rc
Hi, I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene 2.9.1. One of our integration tests, which runs against and embedded server appears to be failing on highlighting. I've included the stack trace and the configuration from solrconf. I'd appreciate any insights. Please let me know what additional information would be useful. Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96) ... 29 more Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141) ... 32 more Caused by: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139) ... 32 more I see in our solrconf the following for highlighting. highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter !-- A regular-expression-based fragmenter (f.i., for sentence extraction) -- fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize70/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter !-- Configure the standard formatter -- formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre![CDATA[strong]]/str str name=hl.simple.post![CDATA[/strong]]/str /lst /formatter /highlighting Thanks, Jake
Question regarding snapinstaller
It looks like the snapinstaller script does an atomic remove and replace of the entire solr_home/data_dir/index folder with the contents of the new snapshot before issuing a commit command. I am trying to understand the implication of the same. What happens to queries that come during the time interval between the instant the existing directory is removed and the commit command gets finalized? Does a currently running instance of Solr not need the files in the index folder to serve the query results? Are all the contents of the index folder loaded into memory? Thanks in advance for any help. Regards, Prasanna.
Re: tracking solr response time
So I need someone with better knowledge to chime in here with an opinion on whether autowarming would help since the whole faceting thing is something I'm not very comfortable with... hint, hint, hint Erick On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh bharathv6.proj...@gmail.com wrote: @Israel: yes I got that point which yonik mentioned .. but is qtime the total time taken by solr server for that request or is it part of time taken by the solr for that request ( is there any thing that a solr server does for that particulcar request which is not included in that qtime bracket ) ? I am sorry for dragging in to this qtime. I just want to be sure, as we observed many times there is huge mismatch between qtime and time measured at the client for the response ( does this imply it is due to internal network issue ) @Erick: yes, many times query is slow first time its executed is there any solution to improve upon this factor .. for querying we use DisMaxRequestHandler , queries are quite long with many faceting parameters . On Mon, Nov 2, 2009 at 10:46 PM, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 9:52 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: Thanks for the quick response @yonik How much of a latency compared to normal, and what version of Solr are you using? latency is usually around 2-4 secs (some times it goes more than that ) which happens to only 15-20% of the request other 80-85% of request are very fast it is in milli secs ( around 200,000 requests happens every day ) @Israel we are not using java client .. we r using python at the client with response formatted in json @yonikn @Israel does qtime measure the total time taken at the solr server ? I am already measuring the time to get the response at client end . I would want a means to know how much time the solr server is taking to respond (process ) once it gets the request . so that I could identify whether it is a solr server issue or internal network issue It is the time spent at the Solr server. I think Yonik already answered this part in his response to your thread : This is what he said : QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). @Israel we are using rhel server 5 on both client and server .. we have 6 solr sever . one is acting as master . both client and solr sever are on the same network . those servers are dedicated solr server except 2 severs which have DB and memcahce running .. we have adjusted the load accordingly On 11/2/09, Israel Ekpo israele...@gmail.com wrote: On Mon, Nov 2, 2009 at 8:41 AM, Yonik Seeley yo...@lucidimagination.comwrote: On Mon, Nov 2, 2009 at 8:13 AM, bharath venkatesh bharathv6.proj...@gmail.com wrote: We are using solr for many of ur products it is doing quite well . But since no of hits are becoming high we are experiencing latency in certain requests ,about 15% of our requests are suffering a latency How much of a latency compared to normal, and what version of Solr are you using? . We are trying to identify the problem . It may be due to network issue or solr server is taking time to process the request . other than qtime which is returned along with the response is there any other way to track solr servers performance ? how is qtime calculated , is it the total time from when solr server got the request till it gave the response ? QTime is the time spent in generating the in-memory representation for the response before the response writer starts streaming it back in whatever format was requested. The stored fields of returned documents are also loaded at this point (to enable handling of huge response lists w/o storing all in memory). There are normally servlet container logs that can be configured to spit out the real total request time. can we do some extra logging to track solr servers performance . ideally I would want to pass some log id along with the request (query ) to solr server and solr server must log the response time along with that log id . Yep - Solr isn't bothered by params it doesn't know about, so just put logid=xxx and it should also be logged with the other request params. -Yonik http://www.lucidimagination.com If you are not using Java then you may have to track the elapsed time manually. If you are using the SolrJ Java client you may have the following options:
Re: field queries seem slow
H, are you sorting? And has your readers been reopened? Is the second query of that sort also slow? If the answer to this last question is no, have you tried some autowarming queries? Best Erick On Mon, Nov 2, 2009 at 4:34 PM, mike anderson saidthero...@gmail.comwrote: I took a look through my Solr logs this weekend and noticed that the longest queries were on particular fields, like author:albert einstein. Is this a result consistent with other setups out there? If not, Is there a trick to make these go faster? I've read up on filter queries and use those when applicable, but they don't really solve all my problems. If anybody wants to take a shot at it but needs to see my solrconfig, etc just let me know. Cheers, Mike
Re: Question about DIH execution order
Bertie, Not sure what you are trying to do, we need a clearer description of what select * returns and what you want to end up in the index. But to answer your question The transformations happen after DIH has performed the SQL statement. In fact the rows output from the SQL command are assigned to the DIH fields and then any transformations are applied. The examples in http://wiki.apache.org/solr/DataImportHandler are quite good. Hi Noble, I tried to understand your suggestions and played different variations according to your reply. But none of them work. Can you explain it in more details? Thanks a lot! BTW, do you mean your solution as follows? document entity name=Course transformer= TemplateTransformer query=select * from Course field column=TmpCourseId name=CourseId template=Course:${Course.CourseId} name=id/ entity name=Rating query=select comment from Rating where Rating.CourseId = ${Course.CourseId} field column=comment name=review/ /entity /entity /document But 1) There is no TmpCourseId field column. 2) Can we put two name CourseId and id in the same map? It seems not. 2009/11/1 Noble Paul ?? Â Ë³Ë noble.p...@corp.aol.com On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen bertie.s...@gmail.com wrote: Hi folks, I have the following data-config.xml. Is there a way to let transformation take place after executing SQL select comment from Rating where Rating.CourseId = ${Course.CourseId}? In MySQL database, column CourseId in table Course is integer 1, 2, etc; template transformation will make them like Course:1, Course:2; column CourseId in table Rating is also integer 1, 2, etc. If transformation happens before executing select comment from Rating where Rating.CourseId = ${Course.CourseId}, then there will no match for the SQL statement execution. document entity name=Course transformer=TemplateTransformer query=select * from Course field column=CourseId template=Course:${Course.CourseId} name=id/ entity name=Rating query=select comment from Rating where Rating.CourseId = ${Course.CourseId} field column=comment name=review/ /entity /entity /document keep the field as follows field column=TmpCourseId name=CourseId template=Course:${Course.CourseId} name=id/ -- - Noble Paul | Principal Engineer| AOL | http://aol.com -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Lucene FieldCache memory requirements
OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
Re: highlighting error using 1.4rc
Umm - crap. This looks looks like a bug in a fix that just went in. My fault on the review. I'll fix it tonight when I get home - unfortunetly, both lucene and sold are about to be released... - Mark http://www.lucidimagination.com (mobile) On Nov 2, 2009, at 5:17 PM, Jake Brownell ja...@benetech.org wrote: Hi, I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene 2.9.1. One of our integration tests, which runs against and embedded server appears to be failing on highlighting. I've included the stack trace and the configuration from solrconf. I'd appreciate any insights. Please let me know what additional information would be useful. Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request (EmbeddedSolrServer.java:153) at org.apache.solr.client.solrj.request.QueryRequest.process (QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query (SolrServer.java:118) at org.bookshare.search.solr.SolrSearchServerWrapper.query (SolrSearchServerWrapper.java:96) ... 29 more Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request (EmbeddedSolrServer.java:141) ... 32 more Caused by: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( WeightedSpanTermExtractor.java:489) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields( WeightedSpanTermExtractor.java:484) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms( WeightedSpanTermExtractor.java:249) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract (WeightedSpanTermExtractor.java:230) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract (WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms( WeightedSpanTermExtractor.java:414) at org.apache.lucene.search.highlight.QueryScorer.initExtractor (QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init (QueryScorer.java:184) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments (Highlighter.java:226) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting (DefaultSolrHighlighter.java:335) at org.apache.solr.handler.component.HighlightComponent.process (HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody (SearchHandler.java:203) at org.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java: 1316) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request (EmbeddedSolrServer.java:139) ... 32 more I see in our solrconf the following for highlighting. highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter !-- A regular-expression-based fragmenter (f.i., for sentence extraction) -- fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize70/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter !-- Configure the standard formatter -- formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre![CDATA[strong]]/str str name=hl.simple.post![CDATA[/strong]]/str /lst /formatter /highlighting Thanks, Jake
Re: Spell check suggestion and correct way of implementation and some Questions
Hello everybody i am able to use spell checker but i have some questions if someone can answer this if i search free text word waranty then i get back suggestion warranty which is fine. but if do a search on field for example description:waranty the output collation element is description:warranty which i dont want i want to get back only the text ie warranty. We are using collation to return back the results since if a user types three words then we use collation in the response element to display the spelling suggestion. Any advice darniz -- View this message in context: http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157893.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spell check suggestion and correct way of implementation and some Questions
Hello everybody i am able to use spell checker but i have some questions if someone can answer this if i search free text word waranty then i get back suggestion warranty which is fine. but if do a search on field for example description:waranty the output collation element is description:warranty which i dont want i want to get back only the text ie warranty. We are using collation to return back the results since if a user types three words then we use collation in the response element to display the spelling suggestion. Any advice darniz -- View this message in context: http://old.nabble.com/Spell-check-suggestion-and-correct-way-of-implementation-and-some-Questions-tp26096664p26157895.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: CPU utilization and query time high on Solr slave when snapshot install
Hmm...I think you have to setup warming queries yourself and that autowarm just copies entries from the old cache to the new cache, rather than issuing queries - the value is how many entries it will copy. Though that's still going to take CPU and time. - Mark http://www.lucidimagination.com (mobile) On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org wrote: If you are going to pull a new index every 10 minutes, try turning off cache autowarming. Your caches are never more than 10 minutes old, so spending a minute warming each new cache is a waste of CPU. Autowarm submits queries to the new Searcher before putting it in service. This will create a burst of query load on the new Searcher, often keeping one CPU pretty busy for several seconds. In solrconfig.xml, set autowarmCount to 0. Also, if you want the slaves to always have an optimized index, create the snapshot only in post-optimize. If you create snapshots in both post-commit and post-optimize, you are creating a non- optimized index (post-commit), then replacing it with an optimized one a few minutes later. A slave might get a non-optimized index one time, then an optimized one the next. wunder On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote: Hi Solr Gurus, We have solr in 1 master, 2 slave configuration. Snapshot is created post commit, post optimization. We have autocommit after 50 documents or 5 minutes. Snapshot puller runs as a cron every 10 minutes. What we have observed is that whenever snapshot is installed on the slave, we see solrj client used to query slave solr, gets timedout and there is high CPU usage/load avg. on slave server. If we stop snapshot puller, then slaves work with no issues. The system has been running since 2 months and this issue has started to occur only now when load on website is increasing. Following are some details: Solr Details: apache-solr Version: 1.3.0 Lucene - 2.4-dev Master/Slave configurations: Master: - for indexing data HTTPRequests are made on Solr server. - autocommit feature is enabled for 50 docs and 5 minutes - caching params are disable for this server - mergeFactor of 10 is set - we were running optimize script after every 2 hours, but now have reduced the duration to twice a day but issue still persists Slave1/Slave2: - standard requestHandler is being used - default values of caching are set Machine Specifications: Master: - 4GB RAM - 1GB JVM Heap memory is allocated to Solr Slave1/Slave2: - 4GB RAM - 2GB JVM Heap memory is allocated to Solr Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. We use HAProxy to load balance query requests between 2 slaves. Master is only used for indexing. Please let us know if somebody has ever faced similar kind of issue or has some insight into it as we guys are literally struck at the moment with a very unstable production environment. As a workaround, we have started running optimize on master every 7 minutes. This seems to have reduced the severity of the problem but still issue occurs every 2days now. please suggest what could be the root cause of this. Thanks, Bipul
Re: solr search
The problem is in db-dataconfig.xml. You should start with the example DataImportHandler configuration fles. The structure is wrong. First there is a datasource, then there are 'entities' which fetch a document's fields from the datasource. On Fri, Oct 30, 2009 at 9:03 PM, manishkbawne manish.ba...@gmail.com wrote: Hi, I have made following changes in solrconfig.xml requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=configC:/Apache-Tomcat/apache-tomcat-6.0.20/solr/conf/db-data-config.xml/str /lst /requestHandler in db-dataconfig.xml dataConfig document name=id1 dataSource type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://servername:1433/databasename user=sa password=p...@123/ entity name=id1 query=select id from be field column=id name=id1 / /entity /document /dataConfig in schema.xml files field name=id1 type=string indexes=true default=none/ Please suggest me the possible cause of error?? Lance Norskog-2 wrote: Please post your dataimporthandler configuration file. On Fri, Oct 30, 2009 at 4:17 AM, manishkbawne manish.ba...@gmail.com wrote: Thanks for your reply .. I am trying to use the database for solr search but getting this error.. abortOnConfigurationErrorfalse/abortOnConfigurationError in null - java.lang.NullPointerException at org.apache.solr.handler.dataimport.DataImporter.init(DataImporter.java:95) at org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:106) at org.apache.solr.core.SolrResourceLoader Can you please suggest me some possible solution? Karsten F. wrote: hi manishkbawne, unspecific ideas of search improvements are her: http://wiki.apache.org/solr/SolrPerformanceFactors I really like the last idea in http://wiki.apache.org/lucene-java/ImproveSearchingSpeed : Use a profiler and ask a more specific question in this forum. Best regards Karsten manishkbawne wrote: I am using solr search to search through xml files. As I am working on millions of data, the result output is slower. Can anyone please suggest me some way, by which I can increase the search result output? -- View this message in context: http://old.nabble.com/solr-search-tp26125183p26128341.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com -- View this message in context: http://old.nabble.com/solr-search-tp26125183p26139946.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: solr web ui
This is what I meant to mention - Uri's GWT browser, not the Velocity toolkit. On Fri, Oct 30, 2009 at 1:20 PM, Grant Ingersoll gsing...@apache.org wrote: There is also a GWT contribution in JIRA that is pretty handy and will likely be added in 1.5. See http://issues.apache.org/jira/browse/SOLR-1163 -Grant On Oct 29, 2009, at 9:17 PM, scabbage wrote: Hi, I'm a new solr user. I would like to know if there are any easy to setup web UIs for solr. It can be as simple as a search box, term highlighting and basic faceting. Basically I'm using solr to store all our automation testing logs and would like to have a simple searchable UI. I don't wanna spent too much time writing my own. Thanks. -- View this message in context: http://www.nabble.com/solr-web-ui-tp26123604p26123604.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
RE: Lucene FieldCache memory requirements
Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit JVM)? I always thought it is (int) documentId... Am I right? Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! Note that for your use case, this is exceptionally wasteful. This is probably very common case... I think it should be confirmed by Lucene developers too... FieldCache is warmed anyway, even when we don't use SOLR... -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-02-09 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad
Re: CPU utilization and query time high on Solr slave when snapshot install
So assuming you set up a few sample sort queries to run in the firstSearcher config, and had very low query volume during that ten minutes so that there were no evictions before a new Searcher was loaded, would those queries run by the firstSearcher be passed along to the cache for the next Searcher as part of the autowarm? If so, it seems like you might want to load a few sort queries for the firstSearcher, but might not need any included in the newSearcher? -Jay On Mon, Nov 2, 2009 at 4:26 PM, Mark Miller markrmil...@gmail.com wrote: Hmm...I think you have to setup warming queries yourself and that autowarm just copies entries from the old cache to the new cache, rather than issuing queries - the value is how many entries it will copy. Though that's still going to take CPU and time. - Mark http://www.lucidimagination.com (mobile) On Nov 2, 2009, at 12:47 PM, Walter Underwood wun...@wunderwood.org wrote: If you are going to pull a new index every 10 minutes, try turning off cache autowarming. Your caches are never more than 10 minutes old, so spending a minute warming each new cache is a waste of CPU. Autowarm submits queries to the new Searcher before putting it in service. This will create a burst of query load on the new Searcher, often keeping one CPU pretty busy for several seconds. In solrconfig.xml, set autowarmCount to 0. Also, if you want the slaves to always have an optimized index, create the snapshot only in post-optimize. If you create snapshots in both post-commit and post-optimize, you are creating a non-optimized index (post-commit), then replacing it with an optimized one a few minutes later. A slave might get a non-optimized index one time, then an optimized one the next. wunder On Nov 2, 2009, at 1:45 AM, biku...@sapient.com wrote: Hi Solr Gurus, We have solr in 1 master, 2 slave configuration. Snapshot is created post commit, post optimization. We have autocommit after 50 documents or 5 minutes. Snapshot puller runs as a cron every 10 minutes. What we have observed is that whenever snapshot is installed on the slave, we see solrj client used to query slave solr, gets timedout and there is high CPU usage/load avg. on slave server. If we stop snapshot puller, then slaves work with no issues. The system has been running since 2 months and this issue has started to occur only now when load on website is increasing. Following are some details: Solr Details: apache-solr Version: 1.3.0 Lucene - 2.4-dev Master/Slave configurations: Master: - for indexing data HTTPRequests are made on Solr server. - autocommit feature is enabled for 50 docs and 5 minutes - caching params are disable for this server - mergeFactor of 10 is set - we were running optimize script after every 2 hours, but now have reduced the duration to twice a day but issue still persists Slave1/Slave2: - standard requestHandler is being used - default values of caching are set Machine Specifications: Master: - 4GB RAM - 1GB JVM Heap memory is allocated to Solr Slave1/Slave2: - 4GB RAM - 2GB JVM Heap memory is allocated to Solr Master and Slave1 (solr1)are on single box and Slave2(solr2) on different box. We use HAProxy to load balance query requests between 2 slaves. Master is only used for indexing. Please let us know if somebody has ever faced similar kind of issue or has some insight into it as we guys are literally struck at the moment with a very unstable production environment. As a workaround, we have started running optimize on master every 7 minutes. This seems to have reduced the severity of the problem but still issue occurs every 2days now. please suggest what could be the root cause of this. Thanks, Bipul
Re: Lucene FieldCache memory requirements
It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down. Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit JVM)? I always thought it is (int) documentId... Am I right? Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! Note that for your use case, this is exceptionally wasteful. This is probably very common case... I think it should be confirmed by Lucene developers too... FieldCache is warmed anyway, even when we don't use SOLR... -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-02-09 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000, without any impact of country field length; it requires 600,000,000 bytes: int is pointer to document (Lucene document ID), and long is pointer to String value... Am I right, is it 600Mb just for this country (indexed, non-tokenized, non-boolean) field and 100 millions docs? I need to calculate exact minimum RAM requirements... I believe it shouldn't depend on cardinality (distribution) of field... Thanks, Fuad -- - Mark http://www.lucidimagination.com
Why does BinaryRequestWriter force the path to be base URL + /update/javabin
Hi folks, First of all, thanks for Solr. It is a great piece of work. I have a question about BinaryRequestWriter in the solrj project. Why does it force the path of UpdateRequests to have be /update/javabin (see BinaryRequestWriter.getPath(String) starting on line 109)? I am extending BinaryRequestWriter specifically to remove this requirement and am interested to know the reasoning behind in the inital choice. Thanks for your time, Stuart
Re: highlighting error using 1.4rc
Sorry - it was a bug in the backport from trunk to 2.9.1 - didn't realize that code didn't get hit because we didn't pass a null field - else the tests would have caught it. Fix has been committed but I don't know whether it will make 2.9.1 or 1.4 because both have gotten the votes and time needed for release. Mark Miller wrote: Umm - crap. This looks looks like a bug in a fix that just went in. My fault on the review. I'll fix it tonight when I get home - unfortunetly, both lucene and sold are about to be released... - Mark http://www.lucidimagination.com (mobile) On Nov 2, 2009, at 5:17 PM, Jake Brownell ja...@benetech.org wrote: Hi, I've tried installing the latest (3rd) RC for Solr 1.4 and Lucene 2.9.1. One of our integration tests, which runs against and embedded server appears to be failing on highlighting. I've included the stack trace and the configuration from solrconf. I'd appreciate any insights. Please let me know what additional information would be useful. Caused by: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at org.bookshare.search.solr.SolrSearchServerWrapper.query(SolrSearchServerWrapper.java:96) ... 29 more Caused by: org.apache.solr.client.solrj.SolrServerException: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:141) ... 32 more Caused by: java.lang.ClassCastException: org.apache.lucene.search.spans.SpanOrQuery cannot be cast to org.apache.lucene.search.spans.SpanNearQuery at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:489) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.collectSpanQueryFields(WeightedSpanTermExtractor.java:484) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:249) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:230) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:414) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:216) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:184) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:226) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:335) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:89) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:203) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:139) ... 32 more I see in our solrconf the following for highlighting. highlighting !-- Configure the standard fragmenter -- !-- This could most likely be commented out in the default case -- fragmenter name=gap class=org.apache.solr.highlight.GapFragmenter default=true lst name=defaults int name=hl.fragsize100/int /lst /fragmenter !-- A regular-expression-based fragmenter (f.i., for sentence extraction) -- fragmenter name=regex class=org.apache.solr.highlight.RegexFragmenter lst name=defaults !-- slightly smaller fragsizes work better because of slop -- int name=hl.fragsize70/int !-- allow 50% slop on fragment sizes -- float name=hl.regex.slop0.5/float !-- a basic sentence pattern -- str name=hl.regex.pattern[-\w ,/\n\']{20,200}/str /lst /fragmenter !-- Configure the standard formatter -- formatter name=html class=org.apache.solr.highlight.HtmlFormatter default=true lst name=defaults str name=hl.simple.pre![CDATA[strong]]/str str name=hl.simple.post![CDATA[/strong]]/str /lst /formatter
Re: Programmatically configuring SLF4J for Solr 1.4?
2009/11/1 Ryan McKinley ryan...@gmail.com I'm sure it is possible to configure JDK logging (java.util.loging) programatically... but I have never had much luck with it. It is very easy to configure log4j programatically, and this works great with solr. Don't suppose I could trouble you for an example? I'm not terribly familiar with Java logging frameworks just yet.
RE: Lucene FieldCache memory requirements
Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb... So, let's vote ;) A. [maxdoc] x [8 bytes ~ pointer to String object] B. [maxdoc] x [8 bytes ~ pointer to Document object] C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] - same as [String1_Document_Count + ... + String10_Document_Count] x [4 bytes ~ DocumentID] D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...] Please confirm that it is Pointer to Object and not Lucene Document ID... I hope it is (int) Document ID... -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 6:52 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down. Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for 64-bit JVM)? I always thought it is (int) documentId... Am I right? Thanks for pointing to http://issues.apache.org/jira/browse/LUCENE-1990! Note that for your use case, this is exceptionally wasteful. This is probably very common case... I think it should be confirmed by Lucene developers too... FieldCache is warmed anyway, even when we don't use SOLR... -Fuad -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: November-02-09 6:00 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements OK I think someone who knows how Solr uses the fieldCache for this type of field will have to pipe up. For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. (Each also consume negligible (for your case) memory to hold the actual string values). Note that for your use case, this is exceptionally wasteful. If Lucene had simple bit-packed ints (I've opened LUCENE-1990 for this) then it'd take much fewer bits to reference the values, since you have only 10 unique string values. Mike On Mon, Nov 2, 2009 at 3:57 PM, Fuad Efendi f...@efendi.ca wrote: I am not using Lucene API directly; I am using SOLR which uses Lucene FieldCache for faceting on non-tokenized fields... I think this cache will be lazily loaded, until user executes sorted (by this field) SOLR query for all documents *:* - in this case it will be fully populated... Subject: Re: Lucene FieldCache memory requirements Which FieldCache API are you using? getStrings? or getStringIndex (which is used, under the hood, if you sort by this field). Mike On Mon, Nov 2, 2009 at 2:27 PM, Fuad Efendi f...@efendi.ca wrote: Any thoughts regarding the subject? I hope FieldCache doesn't use more than 6 bytes per document-field instance... I am too lazy to research Lucene source code, I hope someone can provide exact answer... Thanks Subject: Lucene FieldCache memory requirements Hi, Can anyone confirm Lucene FieldCache memory requirements? I have 100 millions docs with non-tokenized field country (10 different countries); I expect it requires array of (int, long), size of array 100,000,000,
Getting update/extract RequestHandler to work under Tomcat
Hoping someone might help with getting /update/extract RequestHandler to work under Tomcat. Error 500 happens when trying to access http://localhost:8080/apache-solr-1.4-dev/update/extract/ (see below) Note /update/extract DOES work correctly under the Jetty provided example. I think I must have a directory path incorrectly specified but not sure where. No errors in the Catalina log on startup - only this: Nov 2, 2009 7:10:49 PM org.apache.solr.core.RequestHandlers initHandlersFromConfig INFO: created /update/extract: org.apache.solr.handler.extraction.ExtractingRequestHandler Solrconfig.xml under tomcat is slightly changed from the example with regards to lib elements: lib dir=../contrib/extraction/lib / lib dir=../dist/ regex=apache-solr-cell-\d.*\.jar / lib dir=../dist/ regex=apache-solr-clustering-\d.*\.jar /: The \contrib and \dist directories were copied directly below the webapps\apache-solr-1.4-dev unchanged from the example. Im the catalina log I see all the Adding specified lib dirs... added without error: INFO: Adding specified lib dirs to ClassLoader Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we bapps/apache-solr-1.4-dev/contrib/extraction/lib/asm-3.1.jar' to classloader Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcmail-jdk14-136.jar' to classloader Nov 2, 2009 7:31:20 PM org.apache.solr.core.SolrResourceLoader replaceClassLoader INFO: Adding 'file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/we bapps/apache-solr-1.4-dev/contrib/extraction/lib/bcprov-jdk14-136.jar' to classloader (...many more...) Solr Home is mapped to: INFO: SolrDispatchFilter.init() Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: Using JNDI solr.home: .\webapps\apache-solr-1.4-dev\solr Nov 2, 2009 7:10:47 PM org.apache.solr.core.CoreContainer$Initializer initialize INFO: looking for solr.xml: C:\Program Files\Apache Software Foundation\Tomcat 6.0\.\webapps\apache-solr-1.4-dev\solr\solr.xml Nov 2, 2009 7:10:47 PM org.apache.solr.core.SolrResourceLoader init INFO: Solr home set to '.\webapps\apache-solr-1.4-dev\solr\' 500 Error: HTTP Status 500 - lazy loading error org.apache.solr.common.SolrException: lazy loading error at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe dHandler(RequestHandlers.java:249) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleReq uest(RequestHandlers.java:231) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja va:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j ava:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica tionFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt erChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv e.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv e.java:191) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authenticator Base.java:433) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java :128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java :102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve. java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2 93) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.j ava:859) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.proce ss(Http11AprProtocol.java:574) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1527) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.jav a:373) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413) at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe dHandler(RequestHandlers.java:240) ... 17 more Caused by: java.lang.ClassNotFoundException: org.apache.solr.handler.extraction.ExtractingRequestHandler at java.net.URLClassLoader$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at
Re: Lucene FieldCache memory requirements
Fuad Efendi wrote: Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb... I'm not sure I understand - but I didn't mean to imply the +1 on maxdoc meant anything. The issue is that in the end, it only needs a String array the size of String[UniqueTerms] - but because it can't easily figure out that number, it first creates an array of String[MaxDoc+1] - so with a ton of docs and a few uniques, you get a temp boost in the RAM reqs until it sizes it down. A pointer for each doc. -- - Mark http://www.lucidimagination.com
SolrJ looping until I get all the results
If I want to do a query and only return X number of rows at a time, but I want to keep querying until I get all the row, how do I do that? Can I just keep advancing query.setStart(...) and then checking if server.query(query) returns any rows? Or is there a better way? Here's what I'm thinking final static int MAX_ROWS = 100; int start = 0; query.setRows(MAX_ROWS); while (true) { QueryResponse resp = solrChunkServer.query(query); SolrDocumentList docs = resp.getResults(); if (docs.size() == 0) break; start += MAX_ROWS; query.setStart(start); } -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
RE: Lucene FieldCache memory requirements
I just did some tests in a completely new index (Slave), sort by low-distributed non-tokenized Field (such as Country) takes milliseconds, but sort (ascending) on tokenized field with heavy distribution took 30 seconds (initially). Second sort (descending) took milliseconds. Generic query *.*; FieldCache is not used for tokenized fields... how it is sorted :) Fortunately, no any OOM. -Fuad
RE: Lucene FieldCache memory requirements
Mark, I don't understand this: so with a ton of docs and a few uniques, you get a temp boost in the RAM reqs until it sizes it down. Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is not cache? And this: A pointer for each doc. Why can't we use (int) DocumentID? For me, it is natural; 64-bit pointer to an object in RAM is not natural (in Lucene world)... So, is it [maxdoc]x[4-bytes], or [maxdoc]x[8-bytes]?... -Fuad
Re: adding and updating a lot of document to Solr, metadata extraction etc
About large XML files and http overhead: you can tell solr to load the file directly from a file system. This will stream thousands of documents in one XML file without loading everything in memory at once. This is a new book on Solr. It will help you through this early learning phase. http://www.packtpub.com/solr-1-4-enterprise-search-server On Mon, Nov 2, 2009 at 6:24 AM, Alexey Serba ase...@gmail.com wrote: Hi Eugene, - ability to iterate over all documents, returned in search, as Lucene does provide within a HitCollector instance. We would need to extract and aggregate various fields, stored in index, to group results and aggregate them in some way. Also I did not find any way in the tutorial to access the search results with all fields to be processed by our application. http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr Check out Faceted Search, probably you can achieve your goal by using Facet Component There's also Field Collapsing patch http://wiki.apache.org/solr/FieldCollapsing Alex -- Lance Norskog goks...@gmail.com
Re: SolrJ looping until I get all the results
On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh avl...@gmail.com wrote: I was doing it that way, but what I'm doing with the documents is do some manipulation and put the new classes into a different list. Because I basically have two times the number of documents in lists, I'm running out of memory. So I figured if I do it 1000 documents at a time, the SolrDocumentList will get garbage collected at least. You are right w.r.t to all that but I am surprised that you would need ALL the documents from the index for a search requirement. This isn't a search, this is a search and destroy. Basically I need the file names of all the documents that I've indexed in Solr so that I can delete them. -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
Re: SolrJ looping until I get all the results
This isn't a search, this is a search and destroy. Basically I need the file names of all the documents that I've indexed in Solr so that I can delete them. Okay. I am sure you are aware of the fl parameter which restricts the number of fields returned back with a response. If you need limited info, it might be a good idea to use this parameter. Cheers Avlesh On Tue, Nov 3, 2009 at 7:23 AM, Paul Tomblin ptomb...@xcski.com wrote: On Mon, Nov 2, 2009 at 8:47 PM, Avlesh Singh avl...@gmail.com wrote: I was doing it that way, but what I'm doing with the documents is do some manipulation and put the new classes into a different list. Because I basically have two times the number of documents in lists, I'm running out of memory. So I figured if I do it 1000 documents at a time, the SolrDocumentList will get garbage collected at least. You are right w.r.t to all that but I am surprised that you would need ALL the documents from the index for a search requirement. This isn't a search, this is a search and destroy. Basically I need the file names of all the documents that I've indexed in Solr so that I can delete them. -- http://www.linkedin.com/in/paultomblin http://careers.stackoverflow.com/ptomblin
RE: Lucene FieldCache memory requirements
I believe this is correct estimate: C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] same as [String1_Document_Count + ... + String10_Document_Count + ...] x [4 bytes per DocumentID] So, for 100 millions docs we need 400Mb for each(!) non-tokenized field. Although FieldCacheImpl is based on WeakHashMap (somewhere...), we can't rely on sizing down with SOLR faceting features I think I finally found the answer... /** Expert: Stores term text values and document ordering data. */ public static class StringIndex { ... /** All the term values, in natural order. */ public final String[] lookup; /** For each document, an index into the lookup array. */ public final int[] order; ... } Another API: /** Checks the internal cache for an appropriate entry, and if none * is found, reads the term values in codefield/code and returns an array * of size codereader.maxDoc()/code containing the value each document * has in the given field. * @param reader Used to get field values. * @param field Which field contains the strings. * @return The values in the given field for each document. * @throws IOException If any error occurs. */ public String[] getStrings (IndexReader reader, String field) throws IOException; Looks similar; cache size is [maxdoc]; however values stored are 8-byte pointers for 64-bit JVM. private MapClass?,Cache caches; private synchronized void init() { caches = new HashMapClass?,Cache(7); ... caches.put(String.class, new StringCache(this)); caches.put(StringIndex.class, new StringIndexCache(this)); ... } StringCache and StringIndexCache use WeakHashMap internally... but objects won't be ever garbage collected in a faceted production system... SOLR SimpleFacets don't use getStrings API, so the hope is memory requirements are minimized. However, Lucene may use it internally for some queries (or, for instance, to get access to a nontokenized cached field without reading index)... to be safe, use this in your basic memory estimates: [512Mb ~ 1Gb] + [non_tokenized_fields_count] x [maxdoc] x [8 bytes] -Fuad -Original Message- From: Fuad Efendi [mailto:f...@efendi.ca] Sent: November-02-09 7:37 PM To: solr-user@lucene.apache.org Subject: RE: Lucene FieldCache memory requirements Simple field (10 different values: Canada, USA, UK, ...), 64-bit JVM... no difference between maxdoc and maxdoc + 1 for such estimate... difference is between 0.4Gb and 1.2Gb... So, let's vote ;) A. [maxdoc] x [8 bytes ~ pointer to String object] B. [maxdoc] x [8 bytes ~ pointer to Document object] C. [maxdoc] x [4 bytes ~ (int) Lucene Document ID] - same as [String1_Document_Count + ... + String10_Document_Count] x [4 bytes ~ DocumentID] D. [maxdoc] x [4 bytes + 8 bytes ~ my initial naive thinking...] Please confirm that it is Pointer to Object and not Lucene Document ID... I hope it is (int) Document ID... -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 6:52 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements It also briefly requires more memory than just that - it allocates an array the size of maxdoc+1 to hold the unique terms - and then sizes down. Possibly we can use the getUnuiqeTermCount method in the flexible indexing branch to get rid of that - which is why I was thinking it might be a good idea to drop the unsupported exception in that method for things like multi reader and just do the work to get the right number (currently there is a comment that the user should do that work if necessary, making the call unreliable for this). Fuad Efendi wrote: Thank you very much Mike, I found it: org.apache.solr.request.SimpleFacets ... // TODO: future logic could use filters instead of the fieldcache if // the number of terms in the field is small enough. counts = getFieldCacheCounts(searcher, base, field, offset,limit, mincount, missing, sort, prefix); ... FieldCache.StringIndex si = FieldCache.DEFAULT.getStringIndex(searcher.getReader(), fieldName); final String[] terms = si.lookup; final int[] termNum = si.order; ... So that 64-bit requires more memory :) Mike, am I right here? [(8 bytes pointer) + (4 bytes DocID)] x [Number of Documents (100mlns)] (64-bit JVM) 1.2Gb RAM for this... Or, may be I am wrong: For Lucene directly, simple strings would consume an pointer (4 or 8 bytes depending on whether your JRE is 64bit) per doc, and the string index would consume an int (4 bytes) per doc. [8 bytes (64bit)] x [number of documents (100mlns)]? 0.8Gb Kind of Map between String and DocSet, saving 4 bytes... Key is String, and Value is array of 64-bit pointers to Document. Why 64-bit (for
RE: Lucene FieldCache memory requirements
Hi Mark, Yes, I understand it now; however, how will StringIndexCache size down in a production system faceting by Country on a homepage? This is SOLR specific... Lucene specific: Lucene doesn't read from disk if it can retrieve field value for a specific document ID from cache. How will it size down in purely Lucene-based heavy-loaded production system? Especially if this cache is used for query optimizations. -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: November-02-09 8:53 PM To: solr-user@lucene.apache.org Subject: Re: Lucene FieldCache memory requirements static final class StringIndexCache extends Cache { StringIndexCache(FieldCache wrapper) { super(wrapper); } @Override protected Object createValue(IndexReader reader, Entry entryKey) throws IOException { String field = StringHelper.intern(entryKey.field); final int[] retArray = new int[reader.maxDoc()]; String[] mterms = new String[reader.maxDoc()+1]; TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms (new Term (field)); int t = 0; // current term number // an entry for documents that have no terms in this field // should a document with no terms be at top or bottom? // this puts them at the top - if it is changed, FieldDocSortedHitQueue // needs to change as well. mterms[t++] = null; try { do { Term term = termEnum.term(); if (term==null || term.field() != field) break; // store term text // we expect that there is at most one term per document if (t = mterms.length) throw new RuntimeException (there are more terms than + documents in field \ + field + \, but it's impossible to sort on + tokenized fields); mterms[t] = term.text(); termDocs.seek (termEnum); while (termDocs.next()) { retArray[termDocs.doc()] = t; } t++; } while (termEnum.next()); } finally { termDocs.close(); termEnum.close(); } if (t == 0) { // if there are no terms, make the term array // have a single null entry mterms = new String[1]; } else if (t mterms.length) { // if there are less terms than documents, // trim off the dead array space String[] terms = new String[t]; System.arraycopy (mterms, 0, terms, 0, t); mterms = terms; } StringIndex value = new StringIndex (retArray, mterms); return value; } }; The formula for a String Index fieldcache is essentially the String array of unique terms (which does indeed size down at the bottom) and the int array indexing into the String array. Fuad Efendi wrote: To be correct, I analyzed FieldCache awhile ago and I believed it never sizes down... /** * Expert: The default cache implementation, storing all values in memory. * A WeakHashMap is used for storage. * * pCreated: May 19, 2004 4:40:36 PM * * @since lucene 1.4 */ Will it size down? Only if we are not faceting (as in SOLR v.1.3)... And I am still unsure, Document ID vs. Object Pointer. I don't understand this: so with a ton of docs and a few uniques, you get a temp boost in the RAM reqs until it sizes it down. Sizes down??? Why is it called Cache indeed? And how SOLR uses it if it is not cache? -- - Mark http://www.lucidimagination.com
RE: Lucene FieldCache memory requirements
Even in simplistic scenario, when it is Garbage Collected, we still _need_to_be_able_ to allocate enough RAM to FieldCache on demand... linear dependency on document count... Hi Mark, Yes, I understand it now; however, how will StringIndexCache size down in a production system faceting by Country on a homepage? This is SOLR specific... Lucene specific: Lucene doesn't read from disk if it can retrieve field value for a specific document ID from cache. How will it size down in purely Lucene-based heavy-loaded production system? Especially if this cache is used for query optimizations.
Re: Why does BinaryRequestWriter force the path to be base URL + /update/javabin
yup, that can be relaxed. It was just a convention. On Tue, Nov 3, 2009 at 5:24 AM, Stuart Tettemer stette...@gmail.com wrote: Hi folks, First of all, thanks for Solr. It is a great piece of work. I have a question about BinaryRequestWriter in the solrj project. Why does it force the path of UpdateRequests to have be /update/javabin (see BinaryRequestWriter.getPath(String) starting on line 109)? I am extending BinaryRequestWriter specifically to remove this requirement and am interested to know the reasoning behind in the inital choice. Thanks for your time, Stuart -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Question regarding snapinstaller
In Posix-compliant systems (basically Unix system calls) a file exists independent of file names, and there can be multiple names for a file. If a program has a file open, that file can be deleted but it will still exist until the program closes (or the program exits). In the snapinstaller cycle, Solr holds the old index files open while snapinstaller swaps in the new set. The 'commit' operation causes Solr to (eventually) close all of the old index files and at that point they will go away. On Mon, Nov 2, 2009 at 1:26 PM, Prasanna Ranganathan pranganat...@netflix.com wrote: It looks like the snapinstaller script does an atomic remove and replace of the entire solr_home/data_dir/index folder with the contents of the new snapshot before issuing a commit command. I am trying to understand the implication of the same. What happens to queries that come during the time interval between the instant the existing directory is removed and the commit command gets finalized? Does a currently running instance of Solr not need the files in the index folder to serve the query results? Are all the contents of the index folder loaded into memory? Thanks in advance for any help. Regards, Prasanna. -- Lance Norskog goks...@gmail.com
Re: Annotations and reference types
I guess this is not a very good idea. The document itself is a flat data structure. It is hard to see that is nested datastructure. If allowed , how deep would we wish to make it. The simple solution would be to write setters for b_id and b_name in class A and the setters can inject values into B. On Mon, Nov 2, 2009 at 10:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Thu, Oct 29, 2009 at 7:57 PM, M. Tinnemeyer marc-...@gmx.net wrote: Dear listusers, Is there a way to store an instance of class A (including the fields from myB) via solr using annotations ? The index should look like : id; name; b_id; b_name -- Class A { @Field private String id; @Field private String name; @Field private B myB; } -- Class B { @Field(b_id) private String id; @Field(B_name) private String name; } No. I guess you want to represent certain fields in class B and have them as an attribute in Class A (but all fields belong to the same schema), then it can be a worthwhile addition to Solrj. Can you open an issue? A patch would be even better :) -- Regards, Shalin Shekhar Mangar. -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: field queries seem slow
This searches author:albert and (default text field): einstein. This may not be what you expect? On Mon, Nov 2, 2009 at 2:30 PM, Erick Erickson erickerick...@gmail.com wrote: H, are you sorting? And has your readers been reopened? Is the second query of that sort also slow? If the answer to this last question is no, have you tried some autowarming queries? Best Erick On Mon, Nov 2, 2009 at 4:34 PM, mike anderson saidthero...@gmail.comwrote: I took a look through my Solr logs this weekend and noticed that the longest queries were on particular fields, like author:albert einstein. Is this a result consistent with other setups out there? If not, Is there a trick to make these go faster? I've read up on filter queries and use those when applicable, but they don't really solve all my problems. If anybody wants to take a shot at it but needs to see my solrconfig, etc just let me know. Cheers, Mike -- Lance Norskog goks...@gmail.com
Re: tracking solr response time
On Mon, Nov 2, 2009 at 2:21 PM, bharath venkatesh bharathv6.proj...@gmail.com wrote: we observed many times there is huge mismatch between qtime and time measured at the client for the response Long times to stream back the result to the client could be due to - client not reading fast enough - network congestion - reading the stored fields takes a long time - this can happen with really big indexes that can't all fit in memory, and stored fields tend to not be cached well by the OS (essentially random access patterns over a huge area). This ends up causing a disk seek per document being streamed back. - locking contention for reading the index (under Solr 1.3, but not under 1.4 on non-windows platforms) I didn't see where you said what Solr version you were using. There are some pretty big concurrency differences between 1.3 and 1.4 too (if your tests involve many concurrent requests). -Yonik http://www.lucidimagination.com
RE: Lucene FieldCache memory requirements
FieldCache uses internally WeakHashMap... nothing wrong, but... no any Garbage Collection tuning will help in case if allocated RAM is not enough for replacing Weak** with Strong**, especially for SOLR faceting... 10%-15% CPU taken by GC were reported... -Fuad
Proper way to set up Multi Core / Core admin
Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin and the book. Generally everything makes sense, but I have one question. Here's how easy it was: place the solr.war into the server create your core directories in the newly created solr/ directory set up solr.xml, the config files for a data import handler, the [core]/conf/solrconfig.xml [core]/conf/schema.xml, etc copy the /admin directory present in /solr into each /solr/[core] directory Is step 4 a correct step in the setting up of a multi core environment? TIA
Re: Proper way to set up Multi Core / Core admin
Sorry for the confusion - step four is to be avoided, obviously. On Nov 2, 2009, at 11:46 PM, Jonathan Hendler wrote: Getting started with multi core setup following http://wiki.apache.org/solr/CoreAdmin and the book. Generally everything makes sense, but I have one question. Here's how easy it was: place the solr.war into the server create your core directories in the newly created solr/ directory set up solr.xml, the config files for a data import handler, the [core]/conf/solrconfig.xml [core]/conf/schema.xml, etc copy the /admin directory present in /solr into each /solr/[core] directory Is step 4 a correct step in the setting up of a multi core environment? TIA
Re: Match all terms in doc
On Sun, Nov 1, 2009 at 3:33 AM, Magnus Eklund magnus.ekl...@gmail.comwrote: Hi How do I restrict hits to documents containing all words (regardless of order) of a query in particular field? Suppose I have two documents with a field called name in my index: doc1 = name: Pink doc2 = name: Pink Floyd When querying for Pink I want only doc1 and when querying for Pink Floyd or Floyd Pink I want doc2. You can query like: +name:Floyd +name:Pink The + character means a must have condition. This will match documents which have Floyd as well as Pink in any order. -- Regards, Shalin Shekhar Mangar.
Re: solrj query size limit?
Did you hit the limit for maximum number of characters in a GET request? Cheers Avlesh On Tue, Nov 3, 2009 at 9:36 AM, Gregg Horan greggho...@gmail.com wrote: I'm constructing a query using solrj that has a fairly large number of 'OR' clauses. I'm just adding it as a big string to setQuery(), in the format accountId:(this OR that OR yada). This works all day long with 300 values. When I push it up to 350-400 values, I get a Bad Request SolrServerException. It appears to just be a client error - nothing reaching the server logs. Very repeatable... dial it back down and it goes through again fine. The total string length of the query (including a handful of other faceting entries) is about 9500chars. I do have the maxBooleanClauses jacked up to 2048. Using javabin. 1.4-dev. Are there any other options or settings I might be overlooking? -Gregg
Re: Problems downloading lucene 2.9.1
Thanks guys !!! 2009/11/2 Ryan McKinley ryan...@gmail.com On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote: On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote: Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? They won't be there until 2.9.1 is officially released. We are trying to speed up the Solr release by piggybacking on the Lucene release, but this little bit is the one downside. Until then, you can add a repo to: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/ -- Lici
Re: Problems downloading lucene 2.9.1
Well, i've solved this problem executing pre mvn install:install-file -DgroupId=org.apache.lucene -DartifactId=lucene-analyzers -Dversion=2.9.1 -Dpackaging=jar -Dfile=path_to_jar/pre for each lucene-* artifact. I think there must be an easier way to do this, am i wrong? Hope it helps Thx El 3 de noviembre de 2009 08:03, Licinio Fernández Maurelo licinio.fernan...@gmail.com escribió: Thanks guys !!! 2009/11/2 Ryan McKinley ryan...@gmail.com On Nov 2, 2009, at 8:29 AM, Grant Ingersoll wrote: On Nov 2, 2009, at 12:12 AM, Licinio Fernández Maurelo wrote: Hi folks, as we are using an snapshot dependecy to solr1.4, today we are getting problems when maven try to download lucene 2.9.1 (there isn't a any 2.9.1 there). Which repository can i use to download it? They won't be there until 2.9.1 is officially released. We are trying to speed up the Solr release by piggybacking on the Lucene release, but this little bit is the one downside. Until then, you can add a repo to: http://people.apache.org/~mikemccand/staging-area/rc3_lucene2.9.1/maven/ -- Lici -- Lici