Re: Solr performance issue
I've definitely had cases in 1.4.1 where even though I didn't have an OOM error, Solr was being weirdly slow, and increasing the JVM heap size fixed it. I can't explain why it happened, or exactly how you'd know this was going on, I didn't see anything odd in the logs to indicate, I just tried increasing the JVM heap to see what happened, and it worked great. The one case I remember specifically is when I was using the StatsComponent, with a stats.facet. Pathologically slow, increasing heap magically made it go down to negligible again. On 3/14/2011 3:38 PM, Markus Jelsma wrote: Hello, 2011/3/14 Markus Jelsmamarkus.jel...@openindex.io Hi Doğacan, Are you, at some point, running out of heap space? In my experience, that's the common cause of increased load and excessivly high response times (or time outs). How much of a heap size would be enough? Our index size is growing slowly but we did not have this problem a couple weeks ago where index size was maybe 100mb smaller. Telling how much heap space is needed isn't easy to say. It usually needs to be increased when you run out of memory and get those nasty OOM errors, are you getting them? Replication eventes will increase heap usage due to cache warming queries and autowarming. We left most of the caches in solrconfig as default and only increased filterCache to 1024. We only ask for ids (which are unique) and no other fields during queries (though we do faceting). Btw, 1.6gb of our index is stored fields (we store everything for now, even though we do not get them during queries), and about 1gb of index. Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are there a lot of entries? Is there an insanity count? Do you use boost functions? It might not have anything to do with memory at all but i'm just asking. There may be a bug in your revision causing this. Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get any improvement in load. I can try monitoring with Jconsole with 8gigs of heap to see if it helps. Cheers, Hello everyone, First of all here is our Solr setup: - Solr nightly build 986158 - Running solr inside the default jetty comes with solr build - 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of RAM) - Index replicated (on optimize) to slaves via Solr Replication - Size of index is around 2.5gb - No incremental writes, index is created from scratch(delete old documents - commit new documents - optimize) every 6 hours - Avg # of request per second is around 60 (for a single slave) - Avg time per request is around 25ms (before having problems) - Load on each is slave is around 2 We are using this set-up for months without any problem. However last week we started to experience very weird performance problems like : - Avg time per request increased from 25ms to 200-300ms (even higher if we don't restart the slaves) - Load on each slave increased from 2 to 15-20 (solr uses %400-%600 cpu) When we profile solr we see two very strange things : 1 - This is the jconsole output: https://skitch.com/meralan/rwwcf/mail-886x691 As you see gc runs for every 10-15 seconds and collects more than 1 gb of memory. (Actually if you wait more than 10 minutes you see spikes up to 4gb consistently) 2 - This is the newrelic output : https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm As you see solr spent ridiculously long time in SolrDispatchFilter.doFilter() method. Apart form these, when we clean the index directory, re-replicate and restart each slave one by one we see a relief in the system but after some time servers start to melt down again. Although deleting index and replicating doesn't solve the problem, we think that these problems are somehow related to replication. Because symptoms started after replication and once it heals itself after replication. I also see lucene-write.lock files in slaves (we don't have write.lock files in the master) which I think we shouldn't see. If anyone can give any sort of ideas, we will appreciate it. Regards, Dogacan Guney
Re: Solr performance issue
It's actually, as I understand it, expected JVM behavior to see the heap rise to close to it's limit before it gets GC'd, that's how Java GC works. Whether that should happen every 20 seconds or what, I don't nkow. Another option is setting better JVM garbage collection arguments, so GC doesn't stop the world so often. I have had good luck with my Solr using this: -XX:+UseParallelGC On 3/14/2011 4:15 PM, Doğacan Güney wrote: Hello again, 2011/3/14 Markus Jelsmamarkus.jel...@openindex.io Hello, 2011/3/14 Markus Jelsmamarkus.jel...@openindex.io Hi Doğacan, Are you, at some point, running out of heap space? In my experience, that's the common cause of increased load and excessivly high response times (or time outs). How much of a heap size would be enough? Our index size is growing slowly but we did not have this problem a couple weeks ago where index size was maybe 100mb smaller. Telling how much heap space is needed isn't easy to say. It usually needs to be increased when you run out of memory and get those nasty OOM errors, are you getting them? Replication eventes will increase heap usage due to cache warming queries and autowarming. Nope, no OOM errors. We left most of the caches in solrconfig as default and only increased filterCache to 1024. We only ask for ids (which are unique) and no other fields during queries (though we do faceting). Btw, 1.6gb of our index is stored fields (we store everything for now, even though we do not get them during queries), and about 1gb of index. Hmm, it seems 4000 would be enough indeed. What about the fieldCache, are there a lot of entries? Is there an insanity count? Do you use boost functions? Insanity count is 0 and fieldCAche has 12 entries. We do use some boosting functions. Btw, I am monitoring output via jconsole with 8gb of ram and it still goes to 8gb every 20 seconds or so, gc runs, falls down to 1gb. Btw, our current revision was just a random choice but up until two weeks ago it has been rock-solid so we have been reluctant to update to another version. Would you recommend upgrading to latest trunk? It might not have anything to do with memory at all but i'm just asking. There may be a bug in your revision causing this. Anyway, Xmx was 4000m, we tried increasing it to 8000m but did not get any improvement in load. I can try monitoring with Jconsole with 8gigs of heap to see if it helps. Cheers, Hello everyone, First of all here is our Solr setup: - Solr nightly build 986158 - Running solr inside the default jetty comes with solr build - 1 write only Master , 4 read only Slaves (quad core 5640 with 24gb of RAM) - Index replicated (on optimize) to slaves via Solr Replication - Size of index is around 2.5gb - No incremental writes, index is created from scratch(delete old documents - commit new documents - optimize) every 6 hours - Avg # of request per second is around 60 (for a single slave) - Avg time per request is around 25ms (before having problems) - Load on each is slave is around 2 We are using this set-up for months without any problem. However last week we started to experience very weird performance problems like : - Avg time per request increased from 25ms to 200-300ms (even higher if we don't restart the slaves) - Load on each slave increased from 2 to 15-20 (solr uses %400-%600 cpu) When we profile solr we see two very strange things : 1 - This is the jconsole output: https://skitch.com/meralan/rwwcf/mail-886x691 As you see gc runs for every 10-15 seconds and collects more than 1 gb of memory. (Actually if you wait more than 10 minutes you see spikes up to 4gb consistently) 2 - This is the newrelic output : https://skitch.com/meralan/rwwci/solr-requests-solr-new-relic-rpm As you see solr spent ridiculously long time in SolrDispatchFilter.doFilter() method. Apart form these, when we clean the index directory, re-replicate and restart each slave one by one we see a relief in the system but after some time servers start to melt down again. Although deleting index and replicating doesn't solve the problem, we think that these problems are somehow related to replication. Because symptoms started after replication and once it heals itself after replication. I also see lucene-write.lock files in slaves (we don't have write.lock files in the master) which I think we shouldn't see. If anyone can give any sort of ideas, we will appreciate it. Regards, Dogacan Guney
Re: disquery - difference qf qs / pf ps
On 3/10/2011 8:15 AM, Gastone Penzo wrote: Thank you very much. i understand the difference beetween qs and ps but not what pf is...is it necessary to use ps? It's not neccesary to use anything, including Solr. pf: Will take the entire query the user entered, make it into a single phrase, and boost documents within the already existing result set that match that phrase. pf does not change the result set, it just changes the ranking. ps: Will set phrase query slop on that pf query of the entire entered search string, that effects boosting.
Re: True master-master fail-over without data gaps
On 3/9/2011 12:05 PM, Otis Gospodnetic wrote: But check this! In some cases one is not allowed to save content to disk (think copyrights). I'm not making this up - we actually have a customer with this cannot save to disk (but can index) requirement. Do they realize that a Solr index is on disk, and if you save it to a Solr index it's being saved to disk? If they prohibited you from putting the doc in a stored field in Solr, I guess that would at least be somewhat consistent, although annoying. But I don't think it's our customers jobs to tell us HOW to implement our software to get the results they want. They can certainly make you promise not to distribute or use copyrighted material, and they can even ask to see your security procedures to make sure it doesn't get out. But if you need to buffer documents to achieve the application they want, but they won't let you... Solr can't help you with that. As I suggested before though, I might rather buffer to a NoSQL store like MongoDB or CouchDB instead of actually to disk. Perhaps your customer won't notice those stores keep data on disk just like they haven't noticed Solr does. I am not an expert in various kinds of NoSQL stores, but I think some of them in fact specialize in the area of concern here: Absolute failover reliability through replication. Solr is not a store. So buffering to disk is not an option, and buffering in memory is not practical because of the input document rate and their size. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: Excluding results from more like this
Yeah, that just restricts what items are in your main result set (and adding -4 has no real effect). The more like this set is constructed based on your main result set, for each document in it. As far as I can see from here: http://wiki.apache.org/solr/MoreLikeThis ..there seems to be no built-in way to customize the 'more like this' results in the way you want, excluding certain document id's. I don't entirely understand what mlt.boost does, but I don't think it does anything useful for this case. So, if that's so, you are out of luck, unless you want to write Java code. In which case you could try customizing or adding that feature to the MoreLikeThis search component, and either suggest your new code back as a patch, or just use your own customized version of MoreLikeThis. On 3/9/2011 4:29 PM, Brian Lamb wrote: That doesn't seem to do it. Record 4 is still showing up in the MoreLikeThis results. On Wed, Mar 9, 2011 at 4:12 PM, Otis Gospodneticotis_gospodne...@yahoo.com wrote: Brian, ...?q=id:(2 3 5) -4 Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Brian Lambbrian.l...@journalexperts.com To: solr-user@lucene.apache.org Sent: Wed, March 9, 2011 4:05:10 PM Subject: Excluding results from more like this Hi all, I'm using MoreLikeThis to find similar results but I'd like to exclude records by the id number. For example, I use the following URL: http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,score How would I exclude record 4 form the MoreLikeThis results? I tried, http://localhost:8983/solr/search/?q=id:(2 3 5)mlt=truemlt.fl=description,idfl=*,scoremlt.q=!4 But that still returned record 4 in the MoreLikeThisResults.
Re: Same index is ranking differently on 2 machines
Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH) max plus 0.01 times others of: + 0.10713901 = (MATCH) weight(text:dubai in 1551), product of: +0.05489459 = queryWeight(text:dubai), product of: 5.520305 = idf(docFreq=65, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 1.9517226 = (MATCH) fieldWeight(text:dubai in 1551), product of: 1.4142135 = tf(termFreq(text:dubai)=2) 5.520305 = idf(docFreq=65, maxDocs=6063) 0.25 = fieldNorm(field=text, doc=1551) - 1.3196187 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: -0.32609802 = queryWeight(profile:dubai^2.0), product of: + 0.6141165 = (MATCH) weight(profile:dubai^2.0 in 1551), product of: +0.15175761 = queryWeight(profile:dubai^2.0), product of: 2.0 = boost 7.6305184 = idf(docFreq=7, maxDocs=6063) - 0.021368012 = queryNorm + 0.009944122 = queryNorm 4.0466933 = (MATCH) fieldWeight(profile:dubai in 1551), product of: 1.4142135 = tf(termFreq(profile:dubai)=2) 7.6305184 = idf(docFreq=7, maxDocs=6063) 0.375 = fieldNorm(field=profile, doc=1551) -0.36931866 = (MATCH) max plus 0.01 times others of: - 0.0018293816 = (MATCH) weight(text:product^0.1 in 1551), product of: -0.003954251 = queryWeight(text:product^0.1), product of: - 0.1 = boost +0.17194802 = (MATCH) max
Re: Same index is ranking differently on 2 machines
Wait, if you don't have identical indexes, then why would you expect identical results? If your indexes are different, one would expect the results for the same query to be different -- there are different documents in the index! The iDF portion of the TF/iDF type algorithm at the base of Solr's relevancy will also be different in different indexes. http://en.wikipedia.org/wiki/Tf%E2%80%93idf Maybe I'm misunderstanding you. But if you have different indexes -- not exactly the same collection of documents indexed using exactly the same field definitions and rules -- then one should expect different relevance results. Jonathan On 3/9/2011 4:48 PM, Allistair Crossley wrote: That's what I think, glad I am not going mad. I've spent 1/2 a day comparing the config files, checking out from SVN again and ensuring the databases are identical. I cannot see what else I can do to make them equivalent. Both servers checkout directly from SVN, I am convinced the files are the same. The database is definately the same. Not sure what you mean about having identical indices - that's my problem - I don't - or do you mean something else I've missed? But yes everything else you mention is identical, I am as certain as I can be. I too think there must be a difference I have missed but I have run out of ideas for what to check! Frustrating :) On Mar 9, 2011, at 4:38 PM, Jonathan Rochkind wrote: Yes, but the identical index with the identical solrconfig.xml and the identical query and the identical version of Solr on two different machines should preduce identical results. So it's a legitimate question why it's not. But perhaps queryNorm isn't enough to answer that. Sorry, it's out of my league to try and figure out it out. But are you absolutely sure you have identical indexes, identical solrconfig.xml, identical queries, and identical versions of Solr and any other installed Java libraries... on both machines? One of these being different seems more likely than a bug in Solr, although that's possible. On 3/9/2011 4:34 PM, Jayendra Patil wrote: queryNorm is just a normalizing factor and is the same value across all the results for a query, to just make the scores comparable. So even if it varies in different environment, you should not worried about. http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#formula_queryNorm - Defination - queryNorm(q) is just a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable Regards, Jayendra On Wed, Mar 9, 2011 at 4:22 PM, Allistair Crossleya...@roxxor.co.uk wrote: Hi, I am seeing an issue I do not understand and hope that someone can shed some light on this. The issue is that for a particular search we are seeing a particular result rank in position 3 on one machine and position 8 on the production machine. The position 3 is our desired and roughly expected ranking. I have a local machine with solr and a version deployed on a production server. My local machine's solr and the production version are both checked out from our project's SVN trunk. They are identical files except for the data files (not in SVN) and database connection settings. The index is populated exclusively via data import handler queries to a database. I have exported the production database as-is to my local development machine so that my local machine and production have access to the self same data. I execute a total full-import on both. Still, I see a different position for this document that should surely rank in the same location, all else being equal. I ran debugQuery diff to see how the scores were being computed. See appendix at foot of this email. As far as I can tell every single query normalisation block of the debug is marginally different, e.g. -0.021368012 = queryNorm (local) +0.009944122 = queryNorm (production) Which leads to a final score of -2 versus +1 which is enough to skew the results from correct to incorrect (in terms of what we expect to see). - -2.286596 (local) +1.0651637 = (production) I cannot explain this difference. The database is the same. The configuration is the same. I have fully imported from scratch on both servers. What am I missing? Thank you for your time Allistair - snip APPENDIX - debugQuery=on DIFF --- untitled +++ (clipboard) @@ -1,51 +1,49 @@ -str name=L12411p +str name=L12411 -2.286596 = (MATCH) sum of: - 1.6891675 = (MATCH) sum of: -1.3198489 = (MATCH) max plus 0.01 times others of: - 0.023022119 = (MATCH) weight(text:dubai^0.1 in 1551), product of: -0.011795795 = queryWeight(text:dubai^0.1), product of: - 0.1 = boost +1.0651637 = (MATCH) sum of: + 0.7871359 = (MATCH) sum of: +0.6151879 = (MATCH
Re: NRT in Solr
Interesting, does anyone have a summary of what techniques zoie uses to do this? I don't see any docs on the technical details. On 3/9/2011 5:29 PM, Smiley, David W. wrote: Zoie adds NRT to Solr: http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin I haven't tried it yet but looks cool. ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote: Jae, NRT hasn't been implemented NRT as of yet in Solr, I think partially because major features such as replication, caching, and uninverted faceting suddenly are no longer viable, eg, it's another round of testing etc. It's doable, however I think the best approach is a separate request call path, to avoid altering to current [working] API. On Tue, Mar 8, 2011 at 1:27 PM, Jae Joojaejo...@gmail.com wrote: Hi, Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not find the configuration for NRT. Regards Jae
Re: Solr Hanging all of sudden with update/csv
My guess is that you're running out of RAM. Actual Java profiling is beyond me, but I have seen issues on updating that were solved by more RAM. If you are updating every few minutes, and your new index takes more than a few minutes to warm, you could be running into overlapping warming indexes issues. Some more info on what I mean by this in this FAQ, although the FAQ isn't actually targetted at this case exactly: http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F Overlapping warming indexes can result in excessive RAM and/or CPU usage. If you haven't given your JVM options to tune the JVM Garbage Collection, that can also help things, using the options for concurrent thread GC. But if your fundamental problem is overlapping warming queries, you probably need to make that stop. On 3/8/2011 5:17 PM, danomano wrote: Hi folks, I've been using solr for about 3 months. Our Solr install is a single node, and we have been injecting logging data into the solr server every couple of minutes, which each updating taking few minutes. Everything working fine until this morning, at which point it appeared that all updates were hung. Retarting the solr server did not help, as all updaters immediately 'hung' again. Poking around in the threads, and strace, I do in fact see stuff happening. The index size itself is about 270Gb, (we are hopping to support upto 500-1TB), and have supplied the system with ~3TB diskspace. Any Tips on what could be happening? notes: we have never run an optimize yet. we have never deleted from system yet. The merge Thread appears to be the one..'never returnning' Lucene Merge Thread #0 - Thread t@41 java.lang.Thread.State: RUNNABLE at sun.nio.ch.FileDispatcher.pread0(Native Method) at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:31) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234) at sun.nio.ch.IOUtil.read(IOUtil.java:210) at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:622) at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:161) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:139) at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:94) at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:176) at org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:209) at org.apache.lucene.index.SegmentMerger.copyFieldsNoDeletions(SegmentMerger.java:424) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4053) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3645) at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339) at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407) Some ptrace output: 23178 pread(172, \270\316\276\2\245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2..., 4096, 98004192) = 40960.09 23178 pread(172, \245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2..., 4096, 98004196) = 40960.09 23178 pread(172, \271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2..., 4096, 98004200) = 40960.08 23178 pread(172, \272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2..., 4096, 98004204) = 40960.08 23178 pread(172, \273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2..., 4096, 98004208) = 40960.08 23178 pread(172, \274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2..., 4096, 98004212) = 40960.09 23178 pread(172, \275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2..., 4096, 98004216) = 40960.08 23178 pread(172, \276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2..., 4096, 98004220) = 40960.09 23178 pread(172, \277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2..., 4096, 98004224) = 40960.13 22688... futex resumed ) = -1 ETIMEDOUT (Connection timed out)0.051276 23178 pread(172, \300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2\305\316\276\2..., 4096, 98004228) = 40960.10 22688 futex(0x464a9f28, FUTEX_WAKE_PRIVATE, 1 23178 pread(172,
RE: True master-master fail-over without data gaps
I'd honestly think about buffer the incoming documents in some store that's actually made for fail-over persistence reliability, maybe CouchDB or something. And then that's taking care of not losing anything, and the problem becomes how we make sure that our solr master indexes are kept in sync with the actual persistent store; which I'm still not sure about, but I'm thinking it's a simpler problem. The right tool for the right job, that kind of failover persistence is not solr's specialty. From: Otis Gospodnetic [otis_gospodne...@yahoo.com] Sent: Tuesday, March 08, 2011 11:45 PM To: solr-user@lucene.apache.org Subject: True master-master fail-over without data gaps Hello, What are some common or good ways to handle indexing (master) fail-over? Imagine you have a continuous stream of incoming documents that you have to index without losing any of them (or with losing as few of them as possible). How do you set up you masters? In other words, you can't just have 2 masters where the secondary is the Repeater (or Slave) of the primary master and replicates the index periodically: you need to have 2 masters that are in sync at all times! How do you achieve that? * Do you just put N masters behind a LB VIP, configure them both to point to the index on some shared storage (e.g. SAN), and count on the LB to fail-over to the secondary master when the primary becomes unreachable? If so, how do you deal with index locks? You use the Native lock and count on it disappearing when the primary master goes down? That means you count on the whole JVM process dying, which may not be the case... * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters with 2 separate indices in sync, while making sure you write to only 1 of them via LB VIP or otherwise? * Or ... This thread is on a similar topic, but is inconclusive: http://search-lucene.com/m/aOsyN15f1qd1 Here is another similar thread, but this one doesn't cover how 2 masters are kept in sync at all times: http://search-lucene.com/m/aOsyN15f1qd1 Thanks, Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
Re: dismax, and too much qf?
I use about that many qf's in Solr 1.4.1. It works. I'm not entirely sure if it has performance implications -- I do have searching that is somewhat slower then I'd like, but I'm not sure if the lengthy qf is a contributing factor, or other things I'm doing (like a dozen different facet.fields too!). I haven't profiled everything. But it doesn't grind my Solr to a halt or anything, it works. Seperately, I've also been thinking of other ways to get similar highlighting behavior as you describe, give the 'field' that the match was in in the highlight response, but haven't come up with anything great, if your approach works, that's cool. I've been trying to think of a way to store a single stored field in a structured format (CSV? XML?), and somehow have the highlighter return the complete 'field' that matches, not just the surrounding X words. But haven't gotten anywhere on that, just an idle thought. Jonathan On 3/4/2011 10:09 AM, Jeff Schmidt wrote: Hello: I'm working on implementing a requirement where when a document is returned, we want to pithily tell the end user why. That is, say, with five documents returned, they may be so for similar or different reasons. These reasons are the field(s) in which matches occurred. Some are more important than others, and I'll have to return just the most relevant one or two reasons to not overwhelm the user. This is a separate goal than Solr's scoring of the returned documents. That is, index/query time boosting can indicate which fields are more significant in computing the overall document score, but then I need to know what fields where, matched with what terms. I do have an application that stands between Solr and the end user (RESTful API), so I figured I can rank the reasons and return more domain specific names rather than the Solr fields names. So, I've turned to highlighting, and in the results I can see for each document ID the fields matched, and the text in the field etc. Great. But, to get that to work, I have to specifically query individual fields. That is, the approach ofcopyField'ing a bunch of fields to a common text field for efficiency purposes is no longer an option. And, using the dismax request handler, I'm querying a lot of fields: str name=qf n_nameExact^4.0 n_macromolecule_nameExact^3.0 n_macromolecule_name^2.0 n_macromolecule_id^1.8 n_pathway_nameExact^1.5 n_top_regulates n_top_regulated_by n_top_binds n_top_role_in_cell n_top_disease n_molecular_function n_protein_family n_subcell_location n_pathway_name n_cell_component n_bio_process n_synonym^0.5 n_macromolecule_summary^0.6 p_nameExact^4.0 p_name^2.0 p_description^0.6 /str Is that crazy? Is telling Solr to look at so many individual fields going to be a performance problem? I'm only prototyping at this stage and it works great. :) I've not run anything yet at scale handling lots of requests. There are two document types in that shared index, demarcated using a field named type. So, when configuring the SolrJ SolrQuery, I do setup addFilterQuery() to select one or the other type. Anyway, using dismax with all of those query fields along with highlighting, I get the information I need to render meaningful results for the end user. But, it has a sort of smell to it. :) Shall I look for another way, or am I worrying about nothing? I am current using Solr 3.1 trunk. Thanks! Jeff -- Jeff Schmidt 535 Consulting j...@535consulting.com http://www.535consulting.com
RE: Full Text Search with multiple index and complex requirements
While it might be possible to work things out, not just one but several of your requirements are things that are difficult for Solr to do or which solr isn't really optimized to do. Are you sure you need an inverted indexing tool like Solr at all, as opposed to some kind of store (rdbms or nosql), for all or some parts of your data? From: Shrinath M [shrinat...@webyog.com] Sent: Sunday, March 06, 2011 11:49 PM To: rajini maski Cc: solr-user@lucene.apache.org Subject: Re: Full Text Search with multiple index and complex requirements On Mon, Mar 7, 2011 at 9:56 AM, rajini maski rajinima...@gmail.com wrote: I just tried to answer your many questions, liking youe questions type.. Answers attached to questions.. Thank you Rajini, for your interest :) A) The data for every user is totally unrelated to every other user. This gives us few advantages: 1. we can keep our indexes small in size. (using cores) 2. merging/compatcting fragmented index will take less time. (merging is simple,one query) 3. if some indexes becomes inaccessible for whatever reason (corruption?), only those users gets affected. Other users are unaffected and the service is available for them. yes it affects only that index others are unaffected How many cores can we safely have on a machine ? How much is too much in this case ? B) Each user can have few different types of data. So, our index hierarchy will look something like: /user1/type1/index files /user1/type2/index files /user2/type1/index files /user3/type3/index files I am not clear with point here.. Example say you have 2users user1 types- Name , Emailaddress, Phone number user2 types- Name , Emailaddress, ID So you want to have user1 -3indexes plus user2-3indexes Total=6 indexes?? If user1 type phone number is only one type in data index-- Then schema will be having only one data type number type I just meant to say, like this : /myself/docs/index_docs /myself/spreadsheets/index_spreads /yourself/docs/index_docs /yourself/spreadsheets/index_spreads You get the idea right ? C) Often, probably with every itereation, we'll add types of data that can be indexed. So we want to have an efficient/programmatic way to add schemas for different types. We would like to avoid having fixed schema for indexing. you added a type say DATE Before you start indexing for this date type, u need to update your schema with this data type to enable indexing .. correct ? So this wont need a fixed schema defined priorly, we can add this only when you want to add this data type.. But this requires the service restart.. This wont effect current index other then adding to it.. Today I am adding only docs and spreadsheets, tomorrow I may want to add something else, something from RDBMS for example, then I don't want to sit tinkering with schema.xml and I wouldn't like a service restart either... D) The users can fire search queries which will search either: - Within a specific type for that user - Across all types for that user: in this case we want to fire a parallel query like Lucene has. (ParallelMultiSearcher http://lucene.apache.org/java/3_0_2/api/all/org/apache/lucene/search/ParallelMultiSearcher.html ) Shradding in solr workd like this : You have phone number detail in one index and again phone number details only in other index too.. You can search across both index firing a query as , Ph: across index1 and index2 You cannot fire one search query as : Name:xyz and Ph: across index one and index2 .. when index one has datatype defined for only name and index2 has only for phone number.. This can only be done if you define in schema the datatypes for both..(this will create a prob of having same/fixed schema) E) We require real time update for the index. *This is a must.* This can be possible .. Index happening must be enabled every minute , Check if updates made.. If made, re-index and maintain unique ness with the userid We were considering Lucene, Sphinx and Solr to do this. This is what we found: - Sphinx: No efficient way to do A, B, C, F. Or is there? - Luecne: Everything looks possible, as it is very low level. But we have to write wrappers to do F and build a communication layer between the web server and the search server. - Solr: Not sure if we can do A, B, C easily. Can we? So, my question is what is the best software for the above requirements? I am inclined more towards Solr and then Lucene if we get all the requirements. Regards, Rajani Maski On Fri, Mar 4, 2011 at 7:16 PM, Shrinath M shrinat...@webyog.com wrote: We are building an application which will require us to index data for each of our users so that we can provide full text search on their data. Here are some notable things about the application: A) The data for every user is totally unrelated to every other user. This gives us
RE: Model foreign key type of search?
Yep, it's tricky to do this sort of thing in Solr. One way to do it would be to try and reindex the main item on some regular basis with the keywords/comments actually flattened into the main record. Maybe along with a field for number_of_comments, so you can boost on that or what have you. If you can figure out a way to do that, it would be easiest/most reliable without fighting Solr. Beware that it's difficult to set up a Solr that has very frequent commits though, you might want to batch the updates every hour or half hour or what have you. Another thing to look at is this patch which supports a limited type of 'join' in Solr. I'm not sure it's current status of maturity, and I'm not sure if it would work in your use case or not. https://issues.apache.org/jira/browse/SOLR-2272 And, if your alternative is writing your own thing from scratch, another option would be instead writing new components in Java for Solr to try and do what you want. If you can understand the structure and features of the lucene index underlying Solr, and figure out a way to get the functionality you want from lucene, then that's the first step to figuring out how to write a component for Solr to expose it. From: alex.d...@gmail.com [alex.d...@gmail.com] On Behalf Of Alex Dong [a...@trunk.ly] Sent: Friday, March 04, 2011 12:56 AM To: Gora Mohanty Cc: solr-user@lucene.apache.org Subject: Re: Model foreign key type of search? Gora, thanks for the quick reply. Yes, I'm aware of the differences between Solr vs. DBMS. We've actually written some c++ analytical engine that can process through a billion tweets with multiple facets drill down. We may end up cook our own in the end but so far solr suites our needs quite well. The multi-lingual tokenizer and tika integration are all too addictive. What you're suggesting is exactly what I'm doing. Trying to use dynamic fields and copyTo to get all the information into one field, then run the search over that. However, this is not good enough. Allow me to elaborate this using the same Paris example again. Let's say two urls, first has 10 people bookmarked and second has 100. Let's say these two have roughly similar score if we squeeze them into one single field. Then I'd like to rank the one with more users higher. Another way to look at this is PageRank relies on the the number and anchor text of the incoming link, we're trying to use the number of people and their keywords/comments as a weight for the link. Alex On Fri, Mar 4, 2011 at 6:29 PM, Gora Mohanty g...@mimirtech.com wrote: On Fri, Mar 4, 2011 at 10:24 AM, Alex Dong a...@trunk.ly wrote: Hi there, I need some advice on how to implement this using solr: We have two tables: urls and bookmarks. - Each url has four fields: {guid, title, text, url} - One url will have one or more bookmarks associated with it. Each bookmark has these: {link.guid, user, tags, comment} I'd like to return matched urls based on not only the title, text from the url schema, but also some kind of aggregated popularity score based on all bookmarks for the same url. The popularity score should base on number/frequency of bookmarks that match the query. [...] It is best not to think of Solr as a RDBMS, and not to try to graft RDBMS practices on to it. Instead, you should flatten your data, e.g., in the above, you could have: * Four single-valued fields: guid, title, text, url * Four multi-valued fields: bookmark_guid, bookmark_user, bookmark_tags, bookmark_comment Your index would contain one record per guid of the URL, and you would need to populate the multi-valued bookmark fields from all bookmark instances associated with that URL. Then one could either copy the relevant search fields to a full-text search field, and search only on that, or, e.g., search on bookmark_tags and bookmark_comment in addition to searching on title, and text. Regards, Gora
RE: When Index is Updated Frequently
If you can make that solution work for you, I think it is a wise one which will serve you well. In some cases that solution won't work, because you _need_ the frequently changing data in Solr to be searched against in Solr. But if you can get away without that, I think you will be well-served by keeping any data that doesn't need to be searched against by Solr in an external non-Solr store. It's really rarely a bad plan to just put in Solr what needs to be searched against in Solr -- whether or not the 'other' stuff changes frequently. Only you (if anyone!) know enough about your requirements and plans to know how much of a problem it will be to have your 'mutable' data not in Solr, and thus not searchable with Solr. From: Bing Li [lbl...@gmail.com] Sent: Friday, March 04, 2011 3:21 PM To: Michael McCandless Cc: solr-user@lucene.apache.org Subject: Re: When Index is Updated Frequently Dear Michael, Thanks so much for your answer! I have a question. If Lucene is good at updating, it must more loads on the Solr cluster. So in my system, I will leave the large amount of crawled data unchanged for ever. Meanwhile, I use a traditional database to keep mutable data. Fortunately, in most Internet systems, the amount of mutable data is much less than that of immutable one. How do you think about my solution? Best, LB On Sat, Mar 5, 2011 at 2:45 AM, Michael McCandless luc...@mikemccandless.com wrote: On Fri, Mar 4, 2011 at 10:09 AM, Bing Li lbl...@gmail.com wrote: According to my experiences, when the Lucene index updated frequently, its performance must become low. Is it correct? In fact Lucene can gracefully handle a high rate of updates with low latency turnaround on the readers, using the near-real-time (NRT) API -- IndexWriter.getReader() (or in soon-to-be 31, IndexReader.open(IndexWriter)). NRT is really something a hybrid of eventual consistency and immediate consistency, because it lets your app have full control over how quickly changes must be visible by controlling when you pull a new NRT reader. That said, Lucene can't offer true immediate consistency at a high update rate -- the time to open a new NRT reader is usually too costly to do, eg, for every search. But eg every 100 msec (say) is reasonable (depending on many variables...). So... for your app you should run some tests and see. And please report back. (But, unfortunately, NRT hasn't been exposed in Solr yet...). -- Mike http://blog.mikemccandless.com
Re: uniqueKey merge documents on commit
Nope, there is not. On 3/3/2011 10:55 AM, Tim Gilbert wrote: Hi, I have a unique key within my index, but rather than the default behavour of overwriting I am wondering if there is a method to merge the two different documents on commit of the second document. I have a testcase which explains what I'd like to happen: @Test public void testMerge() throws SolrServerException, IOException { SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField(secid, testid); doc1.addField(value1_i, 1); SolrAllSec.GetSolrServer().add(doc1); SolrAllSec.GetSolrServer().commit(); SolrInputDocument doc2 = new SolrInputDocument(); doc2.addField(secid, testid); doc2.addField(value2_i, 2); SolrAllSec.GetSolrServer().add(doc2); SolrAllSec.GetSolrServer().commit(); SolrQuery solrQuery = new SolrQuery(); solrQuery = solrQuery.setQuery(secid:testid); QueryResponse response = SolrAllSec.GetSolrServer().query(solrQuery, METHOD.GET); ListSolrDocument result = response.getResults(); Assert.isTrue(result.size() == 1); Assert.isTrue(result.contains(value1)); Assert.isTrue(result.contains(value2)); } Other than reading doc1 and adding the fields from doc2 and recommitting, is there another way? Thanks in advance, Tim
Re: FilterQuery OR statement
You might also consider splitting your two seperate AND clauses into two seperate fq's: fq=field1:(1 OR 2 OR 3 OR 4) fq=field2:(4 OR 5 OR 6 OR 7) That will cache the two seperate clauses seperately in the field cache, which is probably preferable in general, without knowing more about your use characteristics. ALSO, instead of either supplying the OR explicitly as above, OR changing the default operator in schema.xml for everything, I believe it would work to supply it as a local param: fq={q.op=OR}field1:(1 2 3 4) If you want to do that. AND, your question, can you search without a 'q'? No, but you can search with a 'q' that selects all documents, to be limited by the fq's. q=[* TO *] On 3/3/2011 1:14 PM, Tanner Postert wrote: That worked, thought I tried it before, not sure why it didn't before. Also, is there a way to query without a q parameter? I'm just trying to pull back all of the field results where field1:(1 OR 2 OR 3) etc. so I figured I'd use the FQ param for caching purposes because those queries will likely be run a lot, but if I leave the Q parameter off i get a null pointer error. On Thu, Mar 3, 2011 at 11:05 AM, Ahmet Arslaniori...@yahoo.com wrote: Trying to figure out how I can run something similar to this for the fq parameter Field1 in ( 1, 2, 3 4 ) AND Field2 in ( 4, 5, 6, 7 ) I found some examples on the net that looked like this: fq=+field1:(1 2 3 4) +field2(4 5 6 7) but that yields no results. May be your default operator is set to AND in schema.xml? If yes, try using +field2(4 OR 5 OR 6 OR 7)
Re: multiple localParams for each query clause
Not per clause, no. But you can use the nested queries feature to set local params for each nested query instead. Which is in fact one of the most common use cases for local params. q=_query_:{type=x q.field=z}something AND _query_:{!type=database}something URL encode that whole thing though. http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ On 3/2/2011 10:24 AM, Roman Chyla wrote: Hi, Is it possible to set local arguments for each query clause? example: {!type=x q.field=z}something AND {!type=database}something I am pulling together result sets coming from two sources, Solr index and DB engine - however I realized that local parameters apply only to the whole query - so I don't know how to set the query to mark the second clause as db-searchable. Thanks, Roman
Re: multi-core solr, specifying the data directory
Meanwhile, I'm having trouble getting the expected behavior at all. I'll try to give the right details (without overwhelming with too many), if anyone can see what's going on. Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at /opt/solr/solr_indexer/solr.xml The solr.xml includes actually only one core, let's start out nice and simple: cores adminPath=/admin/cores core name=master_prod instanceDir=master_prod property name=enable.master value=true / /core /cores [The enable.master thing is a custom property my solrconfig.xml uses in places unrelated to dataDir] 1. First try, the solrconfig at /opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO dataDir element at all. WOAH. It just worked. Go figure. I don't know what I tried differently before, maybe Mike is right that people (including me) get confused by the dataDir element being there, and needing to delete it entirely to get that default behavior. So anyway yeah. Sorry, thanks, appears to be working, although possibly confusing for the newbie to set up for reasons that aren't entirely clear, since several of us in this thread had trouble getting it right. On 3/2/2011 2:42 PM, Mike Sokolov wrote: Yes - I commented out thedataDir element in solrconfig.xml and then got the expected behavior: the core used a data subdirectory in the core subdirectory. It seems like the problem arises from using the solrconfig.xml that's distributed as example/solr/conf/solrconfig.xml The solrconfig.xml's in example/multicore/ don't have thedataDir element. -Mike On 03/01/2011 08:24 PM, Chris Hostetter wrote: :!-- Used to specify an alternate directory to hold all index data :other than the default ./data under the Solr home. :If replication is in use, this should match the replication : configuration : . -- :dataDir${solr.data.dir:./solr/data}/dataDir that directive says use the solr.data.dir system property to pick a path, if it is not set, use ./solr/data (realtive the CWD) if you want it to use the default, then you need to eliminate it completley, or you need to change it to the empty string... dataDir${solr.data.dir:}/dataDir or... dataDir/dataDir -Hoss
Re: multi-core solr, specifying the data directory
I wonder if what doesn't work is trying to set an explicit relative path there, instead of using the baked in default data. If you set an explicit relative path, is it relative to the current core solr.home, or to the main solr.home? Let's try it to see Yep, THAT's what doesn't work, and probably what I was trying to do before. In solrconfig.xml for a core, I do dataDirdata/dataDir. I expected that would be interpreted relative to current core solr.home, but it is, judging by the log files, instead based on the 'main' solr.home (above the cores, where the solr.xml is) -- or maybe even on some other value, the tomcat base url or something? Is _that_ a bug? On 3/2/2011 3:38 PM, Jonathan Rochkind wrote: Meanwhile, I'm having trouble getting the expected behavior at all. I'll try to give the right details (without overwhelming with too many), if anyone can see what's going on. Solr 1.4.1. Multi-core. 'Main' solr home with solr.xml at /opt/solr/solr_indexer/solr.xml The solr.xml includes actually only one core, let's start out nice and simple: cores adminPath=/admin/cores core name=master_prod instanceDir=master_prod property name=enable.master value=true / /core /cores [The enable.master thing is a custom property my solrconfig.xml uses in places unrelated to dataDir] 1. First try, the solrconfig at /opt/solr/solr_indexer/master_prod/conf/solrconfig.xml includes NO dataDir element at all. WOAH. It just worked. Go figure. I don't know what I tried differently before, maybe Mike is right that people (including me) get confused by thedataDir element being there, and needing to delete it entirely to get that default behavior. So anyway yeah. Sorry, thanks, appears to be working, although possibly confusing for the newbie to set up for reasons that aren't entirely clear, since several of us in this thread had trouble getting it right. On 3/2/2011 2:42 PM, Mike Sokolov wrote: Yes - I commented out thedataDir element in solrconfig.xml and then got the expected behavior: the core used a data subdirectory in the core subdirectory. It seems like the problem arises from using the solrconfig.xml that's distributed as example/solr/conf/solrconfig.xml The solrconfig.xml's in example/multicore/ don't have thedataDir element. -Mike On 03/01/2011 08:24 PM, Chris Hostetter wrote: :!-- Used to specify an alternate directory to hold all index data :other than the default ./data under the Solr home. :If replication is in use, this should match the replication : configuration : . -- :dataDir${solr.data.dir:./solr/data}/dataDir that directive says use the solr.data.dir system property to pick a path, if it is not set, use ./solr/data (realtive the CWD) if you want it to use the default, then you need to eliminate it completley, or you need to change it to the empty string... dataDir${solr.data.dir:}/dataDir or... dataDir/dataDir -Hoss
Re: multi-core solr, specifying the data directory
I did try that, yes. I tried that first in fact! It seems to fall back to a ./data directory relative to the _main_ solr directory (the one above all the cores), not the core instancedir. Which is not what I expected either. I wonder if this should be considered a bug? I wonder if anyone has considered this and thought of changing/fixing it? On 3/1/2011 4:23 AM, Jan Høydahl wrote: Have you tried removing thedataDir tag from solrconfig.xml? Then it should fall back to default ./data relative to core instancedir. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 1. mars 2011, at 00.00, Jonathan Rochkind wrote: Unless I'm doing something wrong, in my experience in multi-core Solr in 1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir. I set up multi-core like this: cores adminPath=/admin/cores core name=some_core instanceDir=some_core /core /cores Now, setting instanceDir like that works for Solr to look for the 'conf' directory in the default location you'd expect, ./some_core/conf. You'd expect it to look for the 'data' dir for an index in ./some_core/data too, by default. But it does not seem to. It's still looking for the 'data' directory in the _main_ solr.home/data, not under the relevant core directory. The only way I can manage to get it to look for the /data directory where I expect is to spell it out with a full absolute path: core name=some_core instanceDir=some_core property name=dataDir value=/path/to/main/solr/some_core/data / /core And then in the solrconfig.xml do adataDir${dataDir}/dataDir Is this what everyone else does too? Or am I missing a better way of doing this? I would have thought it would just work, with Solr by default looking for a ./data subdir of the specified instanceDir. But it definitely doesn't seem to do that. Should it? Anyone know if Solr in trunk past 1.4.1 has been changed to do what I expect? Or am I wrong to expect it? Or does everyone else do multi-core in some different way than me where this doesn't come up? Jonathan
Re: multi-core solr, specifying the data directory
Hmm, okay, have to try to find time to install the example/multicore and see. It's definitely never worked for me, weird. Thanks. On 3/1/2011 2:38 PM, Chris Hostetter wrote: : Unless I'm doing something wrong, in my experience in multi-core Solr in : 1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir. have you looked at the example/multicore directory that was included in the 1.4.1 release? it has a solr.xml that loads two cores w/o specifying a data dir in the solr.xml (or hte solrconfig.xml) and it uses the data dir inside the specified instanceDir. If that example works for you, but your own configs do not, then we'll need more details about your own configs -- how are you running solr, what does the solrconfig.xml of the core look like, etc... -Hoss
Re: solr different sizes on master and slave
The slave should not keep multiple copies _permanently_, but might temporarily after it's fetched the new files from master, but before it's committed them and fully wamred the new index searchers in the slave. Could that be what's going on, is your slave just still working on committing and warming the new version(s) of the index? [If you do 'commit' to slave (and a replication pull counts as a 'commit') so quick that you get overlapping commits before the slave was able to warm a new index... its' going to be trouble all around.] On 3/1/2011 4:27 PM, Mike Franon wrote: ok doing some more research I noticed, on the slave it has multiple folders where it keeps them for example index index.20110204010900 index.20110204013355 index.20110218125400 and then there is an index.properties that shows which index it is using. I am just curious why does it keep multiple copies? Is there a setting somewhere I can change to only keep one copy so not to lose space? Thanks On Tue, Mar 1, 2011 at 3:26 PM, Mike Franonkongfra...@gmail.com wrote: No pending commits, what it looks like is there are almost two copies of the index on the master, not sure how that happened. On Tue, Mar 1, 2011 at 3:08 PM, Markus Jelsma markus.jel...@openindex.io wrote: Are there pending commits on the master? I was curious why would the size be dramatically different even though the index versions are the same? One is 1.2 Gb, and on the slave it is 512 MB I would think they should both be the same size no? Thanks
Re: Query on multivalue field
Each token has a position set on it. So if you index the value alpha beta gamma, it winds up stored in Solr as (sort of, for the way we want to look at it) document1: alpha:position 1 beta:position 2 gamma: postition 3 If you set the position increment gap large, then after one value in a multi-valued field ends, the position increment gap will be added to the positions for the next value. Solr doesn't actually internally have much of any idea of a multi-valued field, ALL a multi-valued indexed field is, is a position increment gap seperating tokens from different 'values'. So index in a multi-valued field, with position increment gap 1, the values: [alpha beta gamma, aleph bet], you get kind of like: document1: alpha: 1 beta: 2 gamma: 3 aleph: 10004 bet: 10005 A large position increment gap, as far as I know and can tell (please someone correct me if I'm wrong, I am not a Solr developer) has no effect on the size or efficiency of your index on disk. I am not sure why positionIncrementGap doesn't just default to a very large number, to provide behavior that more matches what people expect from the idea of a multi-valued field. So maybe there is some flaw in my understanding, that justifies some reason for it not to be this way? But I set my positionIncrementGap very large, and haven't seen any issues. On 3/1/2011 5:46 PM, Scott Yeadon wrote: The only trick with this is ensuring the searches return the right results and don't go across value boundaries. If I set the gap to the largest text size we expect (approx 5000 chars) what impact does such a large value have (i.e. does Solr physically separate these fragments in the index or just apply the figure as part of any query? Scott. On 2/03/11 9:01 AM, Ahmet Arslan wrote: In a multiValued field, call it field1, if I have two values indexed to this field, say value 1 = some text...termA...more text and value 2 = some text...termB...more text and do a search such as field1:(termA termB) (wheresolrQueryParser defaultOperator=AND/) I'm getting a hit returned even though both terms don't occur within a single value in the multiValued field. What I'm wondering is if there is a way of applying the query against each value of the field rather than against the field in its entirety. The reason being is the number of values I want to store is variable and I'd like to avoid the use of dynamic fields or restructuring the index if possible. Your best bet can be using positionIncrementGap and to issue a phrase query (implicit AND) with the appropriate slop value. Ff you have positionIncrementGap=100, you can simulate this with using q=field1:termA termB~100 http://search-lucene.com/m/Hbdvz1og7D71/
Re: multi-core solr, specifying the data directory
This definitely matches my own experience, and I've heard it from others. I haven't heard of anyone who HAS gotten it to work like that. But apparently there's a distributed multi-core example which claims to work like it doesn't for us. One of us has to try the Solr distro multi-core example, as Hoss suggested/asked, to see if the problem exhibits even there, and if not, figure out what the difference is. Sorry, haven't found time to figure out how to install and start up the demo. I am running in Tomcat, I wonder if container could matter, and maybe it somehow works in Jetty or something? Jonathan On 3/1/2011 7:05 PM, Michael Sokolov wrote: I tried this in my 1.4.0 installation (commenting out what had been working, hoping the default would be as you said works in the example): solr persistent=true sharedLib=lib cores adminPath=/admin/cores core name=bpro instanceDir=bpro !--property name=solr.data.dir value=solr/bpro/data/ -- /core core name=pfapp instanceDir=pfapp property name=solr.data.dir value=solr/pfapp/data/ /core /cores /solr In the log after starting up, I get these messages (among many others): ... Mar 1, 2011 7:51:23 PM org.apache.solr.core.CoreContainer$Initializer initialize INFO: looking for solr.xml: /usr/local/tomcat/solr/solr.xml Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: No /solr/home in JNDI Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoader locateSolrHome INFO: solr home defaulted to 'solr/' (could not find system property or JNDI) Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoaderinit INFO: Solr home set to 'solr/' Mar 1, 2011 7:51:23 PM org.apache.solr.core.SolrResourceLoaderinit INFO: Solr home set to 'solr/bpro/' ... Mar 1, 2011 7:51:24 PM org.apache.solr.core.SolrCoreinit INFO: [bpro] Opening new SolrCore at solr/bpro/, dataDir=./solr/data/ ... Mar 1, 2011 7:51:25 PM org.apache.solr.core.SolrResourceLoaderinit INFO: Solr home set to 'solr/pfapp/' ... Mar 1, 2011 7:51:26 PM org.apache.solr.core.SolrCoreinit INFO: [pfapp] Opening new SolrCore at solr/pfapp/, dataDir=solr/pfapp/data/ and it's pretty clearly using the wrong directory at that point. Some more details: /usr/local/tomcat has the usual tomcat distribution (this is 6.0.29) conf/server.xml has: Host name=localhost appBase=webapps unpackWARs=true autoDeploy=true xmlValidation=false xmlNamespaceAware=false Aliasrosen/Alias Aliasrosen.ifactory.com/Alias Context path= docBase=/usr/local/tomcat/webapps/solr / /Host There is a solrconfig.xml in each of the core directories (should there only be one of these?). I believe these are pretty generic (and they are identical); the one in the bpro folder has: !-- Used to specify an alternate directory to hold all index data other than the default ./data under the Solr home. If replication is in use, this should match the replication configuration . -- dataDir${solr.data.dir:./solr/data}/dataDir -Mike On 3/1/2011 4:38 PM, Jonathan Rochkind wrote: Hmm, okay, have to try to find time to install the example/multicore and see. It's definitely never worked for me, weird. Thanks. On 3/1/2011 2:38 PM, Chris Hostetter wrote: : Unless I'm doing something wrong, in my experience in multi-core Solr in : 1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir. have you looked at the example/multicore directory that was included in the 1.4.1 release? it has a solr.xml that loads two cores w/o specifying a data dir in the solr.xml (or hte solrconfig.xml) and it uses the data dir inside the specified instanceDir. If that example works for you, but your own configs do not, then we'll need more details about your own configs -- how are you running solr, what does the solrconfig.xml of the core look like, etc... -Hoss
setting different solrconfig.xml for a core
So I think I ought to be able to set up a particular solr core to use a different file for solrconfig.xml. (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) Solr 1.4.1, using this for guidance: http://wiki.apache.org/solr/CoreAdmin But no matter what I try, while I get no errors in the log file (should I be looking for errors somewhere else?), the core doesn't successfully come up. I am trying in the solr.xml, to do this: core name=master_prod instanceDir=master_prod config=master-solrconfig.xml property name=dataDir value=/opt/solr/solr_indexer/master_prod/data / /core Or I try this instead: core name=master_prod instanceDir=master_prod config=master-solrconfig.xml property name=dataDir value=/opt/solr/solr_indexer/master_prod/data / property name=configName value=master-solrconfig.xml / /core With either of these, in the log file things look like they started up succesfully but it doesn't appear to actually be so, the core is actually inaccessible. Maybe there's an error in my master-solrconfig.xml, but I don't think so, and there's nothing in the log on that either. Or maybe I'm not doing things right as far as telling it to use the 'config file' solrconfig.xml in a different location. Can anyone confirm for me that this is possible, and what the right way to try and do it is?
Re: setting different solrconfig.xml for a core
On 2/28/2011 1:09 PM, Ahmet Arslan wrote: (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) How about using same solrconfig.xml for master too? As described here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node That isn't great, becuase there are more differences in optimal solrconfig.xml between master and slave than just the replication handler difference, which that URL covers. A master (which won't be queried against) doesn't need spellcheck running after commits, but the slave does. A master doesn't need slow newsearcher/firstsearcher query warm-ups, but the slave does. The master may be better with different (lower) cache settings, since it won't be used to service live queries. The documentation clearly suggests it _ought_ to be possible to tell a core the name of it's config file (default solrconfig.xml) to be something other than solrconfig.xml -- but I havent' been able to make it work, and find the lack of any errors in the log file when it's not working to be frustrating. Has anyone actually done this? Can anyone confirm that it's even possible, and the documentation isn't just taking me for a ride?
Re: setting different solrconfig.xml for a core
Okay, I did manage to find a clue from the log that it's not working, when it's not working: INFO: Jk running ID=0 time=0/66 config=null config=null, that's not right. When I try to over-ride the config file name in solr.xml core config, I can't seem to put a name in there that works to find a file that does actually exist. Unless I put the name solrconfig.xml in there, then it works fine, heh. On 2/28/2011 3:00 PM, Jonathan Rochkind wrote: On 2/28/2011 1:09 PM, Ahmet Arslan wrote: (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) How about using same solrconfig.xml for master too? As described here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node That isn't great, becuase there are more differences in optimal solrconfig.xml between master and slave than just the replication handler difference, which that URL covers. A master (which won't be queried against) doesn't need spellcheck running after commits, but the slave does. A master doesn't need slow newsearcher/firstsearcher query warm-ups, but the slave does. The master may be better with different (lower) cache settings, since it won't be used to service live queries. The documentation clearly suggests it _ought_ to be possible to tell a core the name of it's config file (default solrconfig.xml) to be something other than solrconfig.xml -- but I havent' been able to make it work, and find the lack of any errors in the log file when it's not working to be frustrating. Has anyone actually done this? Can anyone confirm that it's even possible, and the documentation isn't just taking me for a ride?
Re: setting different solrconfig.xml for a core
Yeah, I'm actually _not_ trying to get replication to copy over the config files. Instead, I'm assuming the config files are all there, and I'm actually trying to get one of the cores to _use_ a file that actually on disk in that core is called, eg, solrconfig_slave.xml. This wiki page: http://wiki.apache.org/solr/CoreAdmin suggests I _ought_ to be able to do that, to tell a particular core to use a config file of any name I want. But I'm having trouble getting it to work. But that could be my own local mistake of some kind too. Just makes it harder to figure out when I'm not even exactly sure how you're _supposed_ to be able to do that -- CoreAdmin wiki page implies at least two different ways you should be able to do it, but doesn't include an actual example so I'm not sure if I'm understanding what it's implying correctly -- or if the actual 1.4.1 behavior matches what's in that wiki page anyway. On 2/28/2011 3:14 PM, Dyer, James wrote: Jonathan, When I was first setting up replication a couple weeks ago, I had this working, as described here: http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml I created the slave's solrconfig.xml and saved it on the master in the conf dir as solrconfig_slave.xml, then began the confFiles parameter on the master with solrconfig_slave.xml:solrconfig.xml,schema.xml,etc. And it was working (v1.4.1). I'm not sure why you haven't had good luck with this but you can at least know it is possible to get it to work. I think to get the slave up and running for the first time I saved the slave's version on the slave as solrconfig.xml. It then would copy over any changed versions of solrconfig_slave.xml from the master to the slave, saving them on the slave as solrconfig.xml. But I primed it by giving it its config file in-sync to start with. I ended up going the same-config-file-everywhere route though because we're using our master to handle requests when its not indexing (one less server to buy)... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, February 28, 2011 2:03 PM To: solr-user@lucene.apache.org Subject: Re: setting different solrconfig.xml for a core Okay, I did manage to find a clue from the log that it's not working, when it's not working: INFO: Jk running ID=0 time=0/66 config=null config=null, that's not right. When I try to over-ride the config file name in solr.xml core config, I can't seem to put a name in there that works to find a file that does actually exist. Unless I put the name solrconfig.xml in there, then it works fine, heh. On 2/28/2011 3:00 PM, Jonathan Rochkind wrote: On 2/28/2011 1:09 PM, Ahmet Arslan wrote: (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) How about using same solrconfig.xml for master too? As described here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node That isn't great, becuase there are more differences in optimal solrconfig.xml between master and slave than just the replication handler difference, which that URL covers. A master (which won't be queried against) doesn't need spellcheck running after commits, but the slave does. A master doesn't need slow newsearcher/firstsearcher query warm-ups, but the slave does. The master may be better with different (lower) cache settings, since it won't be used to service live queries. The documentation clearly suggests it _ought_ to be possible to tell a core the name of it's config file (default solrconfig.xml) to be something other than solrconfig.xml -- but I havent' been able to make it work, and find the lack of any errors in the log file when it's not working to be frustrating. Has anyone actually done this? Can anyone confirm that it's even possible, and the documentation isn't just taking me for a ride?
Re: setting different solrconfig.xml for a core
Aha, wait, I think I've made it work, as simple as this in the solr.xml core config, to make a core use a solrconfig.xml file with a different name: ... core name=master_prod instanceDir=master_prod config=master-solrconfig.xml ... Not sure why that didn't work the first half a dozen times I tried. May have had a syntax error in my master-solrconfig.xml file, even though the Solr log files didn't report any, maybe when there's a syntax error Solr just silently gives up on the config file and presents an empty index, I dunno. On 2/28/2011 3:46 PM, Jonathan Rochkind wrote: Yeah, I'm actually _not_ trying to get replication to copy over the config files. Instead, I'm assuming the config files are all there, and I'm actually trying to get one of the cores to _use_ a file that actually on disk in that core is called, eg, solrconfig_slave.xml. This wiki page: http://wiki.apache.org/solr/CoreAdmin suggests I _ought_ to be able to do that, to tell a particular core to use a config file of any name I want. But I'm having trouble getting it to work. But that could be my own local mistake of some kind too. Just makes it harder to figure out when I'm not even exactly sure how you're _supposed_ to be able to do that -- CoreAdmin wiki page implies at least two different ways you should be able to do it, but doesn't include an actual example so I'm not sure if I'm understanding what it's implying correctly -- or if the actual 1.4.1 behavior matches what's in that wiki page anyway. On 2/28/2011 3:14 PM, Dyer, James wrote: Jonathan, When I was first setting up replication a couple weeks ago, I had this working, as described here: http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml I created the slave's solrconfig.xml and saved it on the master in the conf dir as solrconfig_slave.xml, then began the confFiles parameter on the master with solrconfig_slave.xml:solrconfig.xml,schema.xml,etc. And it was working (v1.4.1). I'm not sure why you haven't had good luck with this but you can at least know it is possible to get it to work. I think to get the slave up and running for the first time I saved the slave's version on the slave as solrconfig.xml. It then would copy over any changed versions of solrconfig_slave.xml from the master to the slave, saving them on the slave as solrconfig.xml. But I primed it by giving it its config file in-sync to start with. I ended up going the same-config-file-everywhere route though because we're using our master to handle requests when its not indexing (one less server to buy)... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, February 28, 2011 2:03 PM To: solr-user@lucene.apache.org Subject: Re: setting different solrconfig.xml for a core Okay, I did manage to find a clue from the log that it's not working, when it's not working: INFO: Jk running ID=0 time=0/66 config=null config=null, that's not right. When I try to over-ride the config file name in solr.xml core config, I can't seem to put a name in there that works to find a file that does actually exist. Unless I put the name solrconfig.xml in there, then it works fine, heh. On 2/28/2011 3:00 PM, Jonathan Rochkind wrote: On 2/28/2011 1:09 PM, Ahmet Arslan wrote: (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) How about using same solrconfig.xml for master too? As described here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node That isn't great, becuase there are more differences in optimal solrconfig.xml between master and slave than just the replication handler difference, which that URL covers. A master (which won't be queried against) doesn't need spellcheck running after commits, but the slave does. A master doesn't need slow newsearcher/firstsearcher query warm-ups, but the slave does. The master may be better with different (lower) cache settings, since it won't be used to service live queries. The documentation clearly suggests it _ought_ to be possible to tell a core the name of it's config file (default solrconfig.xml) to be something other than solrconfig.xml -- but I havent' been able to make it work, and find the lack of any errors in the log file when it's not working to be frustrating. Has anyone actually done this? Can anyone confirm that it's even possible, and the documentation isn't just taking me for a ride?
Re: setting different solrconfig.xml for a core
And in other news of other possibilities. If I DID want to use the same solrconfig.xml for both master and slave, but disable the newsearcher/firstsearcher queries on master, it _looks_ like I can use the techique here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node Applied to newsearcher/firstsearcher too: listener event=firstSearcher class=solr.QuerySenderListener enable=${enable.slave:false} Now that listener will only be turned on if enable.slave is set to true. Might make more sense to use a different property value there, like enable.searcher or something. I'm not entirely sure in what places the enable attribute is recognized and in what places it isn't, but it LOOKS like it's recognized on the listener tag. I think. On 2/28/2011 3:52 PM, Jonathan Rochkind wrote: Aha, wait, I think I've made it work, as simple as this in the solr.xml core config, to make a core use a solrconfig.xml file with a different name: ...core name=master_prod instanceDir=master_prod config=master-solrconfig.xml ... Not sure why that didn't work the first half a dozen times I tried. May have had a syntax error in my master-solrconfig.xml file, even though the Solr log files didn't report any, maybe when there's a syntax error Solr just silently gives up on the config file and presents an empty index, I dunno. On 2/28/2011 3:46 PM, Jonathan Rochkind wrote: Yeah, I'm actually _not_ trying to get replication to copy over the config files. Instead, I'm assuming the config files are all there, and I'm actually trying to get one of the cores to _use_ a file that actually on disk in that core is called, eg, solrconfig_slave.xml. This wiki page: http://wiki.apache.org/solr/CoreAdmin suggests I _ought_ to be able to do that, to tell a particular core to use a config file of any name I want. But I'm having trouble getting it to work. But that could be my own local mistake of some kind too. Just makes it harder to figure out when I'm not even exactly sure how you're _supposed_ to be able to do that -- CoreAdmin wiki page implies at least two different ways you should be able to do it, but doesn't include an actual example so I'm not sure if I'm understanding what it's implying correctly -- or if the actual 1.4.1 behavior matches what's in that wiki page anyway. On 2/28/2011 3:14 PM, Dyer, James wrote: Jonathan, When I was first setting up replication a couple weeks ago, I had this working, as described here: http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml I created the slave's solrconfig.xml and saved it on the master in the conf dir as solrconfig_slave.xml, then began the confFiles parameter on the master with solrconfig_slave.xml:solrconfig.xml,schema.xml,etc. And it was working (v1.4.1). I'm not sure why you haven't had good luck with this but you can at least know it is possible to get it to work. I think to get the slave up and running for the first time I saved the slave's version on the slave as solrconfig.xml. It then would copy over any changed versions of solrconfig_slave.xml from the master to the slave, saving them on the slave as solrconfig.xml. But I primed it by giving it its config file in-sync to start with. I ended up going the same-config-file-everywhere route though because we're using our master to handle requests when its not indexing (one less server to buy)... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, February 28, 2011 2:03 PM To: solr-user@lucene.apache.org Subject: Re: setting different solrconfig.xml for a core Okay, I did manage to find a clue from the log that it's not working, when it's not working: INFO: Jk running ID=0 time=0/66 config=null config=null, that's not right. When I try to over-ride the config file name in solr.xml core config, I can't seem to put a name in there that works to find a file that does actually exist. Unless I put the name solrconfig.xml in there, then it works fine, heh. On 2/28/2011 3:00 PM, Jonathan Rochkind wrote: On 2/28/2011 1:09 PM, Ahmet Arslan wrote: (The reason I want to do this is so I can have master and slave in replication have the exact same repo checkout for their conf directory, but have the master using a different solrconfig.xml, one set up to be master.) How about using same solrconfig.xml for master too? As described here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node That isn't great, becuase there are more differences in optimal solrconfig.xml between master and slave than just the replication handler difference, which that URL covers. A master (which won't be queried against) doesn't need spellcheck running after commits, but the slave does. A master doesn't need slow newsearcher/firstsearcher query warm-ups, but the slave does. The master may be better
Re: setting different solrconfig.xml for a core
Hmm, I'm pretty sure I'm seeing that listener can take an 'enable' attribute too. Even though that's not a searchComponent or a requestComponent, is it? After toggling enable back on forth on a listener and restarting Solr and watching my logs closely, I am as confident as I can be that it mysteriously is being respected on listener. Go figure. Convenient for me, because I wanted to disable my expensive and timeconsuming newSearcher/firstSearcher warming queries on a core marked 'master'. On 2/28/2011 4:21 PM, Dyer, James wrote: Just did a quick search for ' enable= ' in the 1.4.1 source. Looks like from the example solrconfig.xml, bothsearchComponent andrequestHandler tags can take the enable attribute. Its only shown with the ClusteringComponent so I'm not sure if just any SC or RH will honor it. Also see the unit test TestPluginEnable.java, which seems to show that the StandardRequestHandler will honor it. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, February 28, 2011 3:09 PM To: solr-user@lucene.apache.org Subject: Re: setting different solrconfig.xml for a core And in other news of other possibilities. If I DID want to use the same solrconfig.xml for both master and slave, but disable the newsearcher/firstsearcher queries on master, it _looks_ like I can use the techique here: http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node Applied to newsearcher/firstsearcher too: listener event=firstSearcher class=solr.QuerySenderListener enable=${enable.slave:false} Now that listener will only be turned on if enable.slave is set to true. Might make more sense to use a different property value there, like enable.searcher or something. I'm not entirely sure in what places the enable attribute is recognized and in what places it isn't, but it LOOKS like it's recognized on thelistener tag. I think. On 2/28/2011 3:52 PM, Jonathan Rochkind wrote: Aha, wait, I think I've made it work, as simple as this in the solr.xml core config, to make a core use a solrconfig.xml file with a different name: ...core name=master_prod instanceDir=master_prod config=master-solrconfig.xml ... Not sure why that didn't work the first half a dozen times I tried. May have had a syntax error in my master-solrconfig.xml file, even though the Solr log files didn't report any, maybe when there's a syntax error Solr just silently gives up on the config file and presents an empty index, I dunno. On 2/28/2011 3:46 PM, Jonathan Rochkind wrote: Yeah, I'm actually _not_ trying to get replication to copy over the config files. Instead, I'm assuming the config files are all there, and I'm actually trying to get one of the cores to _use_ a file that actually on disk in that core is called, eg, solrconfig_slave.xml. This wiki page: http://wiki.apache.org/solr/CoreAdmin suggests I _ought_ to be able to do that, to tell a particular core to use a config file of any name I want. But I'm having trouble getting it to work. But that could be my own local mistake of some kind too. Just makes it harder to figure out when I'm not even exactly sure how you're _supposed_ to be able to do that -- CoreAdmin wiki page implies at least two different ways you should be able to do it, but doesn't include an actual example so I'm not sure if I'm understanding what it's implying correctly -- or if the actual 1.4.1 behavior matches what's in that wiki page anyway. On 2/28/2011 3:14 PM, Dyer, James wrote: Jonathan, When I was first setting up replication a couple weeks ago, I had this working, as described here: http://wiki.apache.org/solr/SolrReplication#Replicating_solrconfig.xml I created the slave's solrconfig.xml and saved it on the master in the conf dir as solrconfig_slave.xml, then began the confFiles parameter on the master with solrconfig_slave.xml:solrconfig.xml,schema.xml,etc. And it was working (v1.4.1). I'm not sure why you haven't had good luck with this but you can at least know it is possible to get it to work. I think to get the slave up and running for the first time I saved the slave's version on the slave as solrconfig.xml. It then would copy over any changed versions of solrconfig_slave.xml from the master to the slave, saving them on the slave as solrconfig.xml. But I primed it by giving it its config file in-sync to start with. I ended up going the same-config-file-everywhere route though because we're using our master to handle requests when its not indexing (one less server to buy)... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Monday, February 28, 2011 2:03 PM To: solr-user@lucene.apache.org Subject: Re: setting different solrconfig.xml for a core Okay, I did manage to find a clue from the log that it's not working, when it's
suggestion: do not require masterUrl for slave config
Suggestion, curious what other people think of it, if I should bother filing a JIRA and/or trying to come up with a patch. Currently, when you configure a replication lst name=slave, you HAVE to give it a masterUrl. SEVERE: org.apache.solr.common.SolrException: 'masterUrl' is required for a slave at org.apache.solr.handler.SnapPuller.init(SnapPuller.java:126) at org.apache.solr.handler.ReplicationHandler.inform(ReplicationHandler.java:775) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508) at org.apache.solr.core.SolrCore.init(SolrCore.java:588) At first this makes sense, why would you want a slave without a masterUrl? But since you can supply the masterUrl as a query parameter in /replication?command=fetchIndexmasterUrl=X, there's really no reason to require you to specify it in the solrconfig.xml, if you are planning on not having automatic polling, but just triggering replication manually, and supplying the masterUrl in the command every time. This can sometimes be convenient for letting some other monitor process decide when and how to replicate, instead of having solr itself be configured for pulling via polling. Does that make any sense?
multi-core solr, specifying the data directory
Unless I'm doing something wrong, in my experience in multi-core Solr in 1.4.1, you NEED to explicitly provide an absolute path to the 'data' dir. I set up multi-core like this: cores adminPath=/admin/cores core name=some_core instanceDir=some_core /core /cores Now, setting instanceDir like that works for Solr to look for the 'conf' directory in the default location you'd expect, ./some_core/conf. You'd expect it to look for the 'data' dir for an index in ./some_core/data too, by default. But it does not seem to. It's still looking for the 'data' directory in the _main_ solr.home/data, not under the relevant core directory. The only way I can manage to get it to look for the /data directory where I expect is to spell it out with a full absolute path: core name=some_core instanceDir=some_core property name=dataDir value=/path/to/main/solr/some_core/data / /core And then in the solrconfig.xml do a dataDir${dataDir}/dataDir Is this what everyone else does too? Or am I missing a better way of doing this? I would have thought it would just work, with Solr by default looking for a ./data subdir of the specified instanceDir. But it definitely doesn't seem to do that. Should it? Anyone know if Solr in trunk past 1.4.1 has been changed to do what I expect? Or am I wrong to expect it? Or does everyone else do multi-core in some different way than me where this doesn't come up? Jonathan
RE: Disabling caching for fq param?
As far as I know there is not, it might be beneficial, but also worth considering: thousands of users isn't _that_ many, and if that same clause is always the same per user, then if the same user does a query a second time, it wouldn't hurt to have their user-specific fq in the cache. A single fq cache may not take as much RAM as you think, you could potentially afford increase your fq cache size to thousands/tens-of-thousands, and win all the way around. The filter cache should be a least-recently-used-out-first cache, so even if the filter cache isn't big enough for all of them, fq's that are used by more than one user will probably stay in the cache as old user-specific fq's end up falling off the back as least-recently-used. So in actual practice, one way or another, it may not be a problem. From: mrw [mikerobertsw...@gmail.com] Sent: Monday, February 28, 2011 9:06 PM To: solr-user@lucene.apache.org Subject: Disabling caching for fq param? Based on what I've read here and what I could find on the web, it seems that each fq clause essentially gets its own results cache. Is that correct? We have a corporate policy of passing the user's Oracle OLS labels into the index in order to be matched against the labels field. I currently separate this from the user's query text by sticking it into an fq param... ?q=user-entered expression fq=labels:the label values expression qf=song metadata copy field song lyrics field tie=0.1 defType=dismax ...but since its value (a collection of hundreds of label values) only apply to that user, the accompanying result set won't be reusable by other users: My understanding is that this query will result in two result sets (q and fq) being cached separately, with the union of the two sets being returned to the user. (Is that correct?) There are thousands of users, each with a unique combination of labels, so there seems to be little value in caching the result set created from the fq labels param. It would be beneficial if there were some kind of fq parameter override to indicate to Solr to not cache the results? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/Disabling-caching-for-fq-param-tp2600188p2600188.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: query results filter
Hmm, depending on what you are actually needing to do, can you do it with a simple fq param to filter out what you want filtered out, instead of needing to write custom Java as you are suggesting? It would be a lot easier to just use an fq. How would you describe the documents you want to filter from the query results page? Can that description be represented by a Solr query you can already represent using the lucene, dismax, or any other existing query? If so, why not just use a negated fq describing what to omit from the results? From: Babak Farhang [farh...@gmail.com] Sent: Thursday, February 24, 2011 6:58 PM To: solr-user Subject: query results filter Hi everyone, I have some existing solr cores that for one reason or another have documents that I need to filter from the query results page. I would like to do this inside Solr instead of doing it on the receiving end, in the client. After searching the mailing list archives and Solr wiki, it appears you do this by registering a custom SearchHandler / SearchComponent with Solr. Still, I don't quite understand how this machinery fits together. Any suggestions / ideas / pointers much appreciated! Cheers, -Babak ~~ Ideally, I'd like to find / code a solution that does the following: 1. A request handler that works like the StandardRequestHandler but which allows an optional DocFilter (say, modeled like the java.io.FileFilter interface) 2. Allows current pagination to work transparently. 3. Works transparently with distributed/sharded queries.
RE: Best way for a query-expander?
I don't think there's any way to do this in Solr, although you could write your own query parser in Java if you wanted to. You can set defaults , invariants and appends values on your request handler, but I don't think that's flexible enough to do what you want. http://wiki.apache.org/solr/SearchHandler In general, to my perspective, Solr seems to be written assuming a trusted client. If you are allowing access to untrusted clients, there are probably all sorts of things a client can do that you woudln't want them to, writing your own query parser might be a good idea. From: Paul Libbrecht [p...@hoplahup.net] Sent: Saturday, February 19, 2011 11:01 AM To: solr-user@lucene.apache.org Subject: Re: Best way for a query-expander? Hello list, as Hoss suggests, I'll try to be more detailed. I wish to use http parameters in my requests that define the precise semantic of an advanced search. For example, if I can see from sessions, that a given user is requesting, that not only public resources but resources private-to-him are returned. For example, if there's a parameter ict, I want to expand the query with an extra (mandatory) term-query. I know I could probably do this at the client level but I do not think this is the best way, in particular about the access to private resources... I also think it's better to not rely too heavily on client's ability to formula string-queries since it allows all sorts of tweaking that one may not wish possible, in particular for queries that are service oriented. paul Le 19 févr. 2011 à 01:18, Chris Hostetter a écrit : : I want to implement a query-expander, one that enriches the input by the : usage of extra parameters that, for example, a form may provide. : : Is the right way to subclass SearchHandler? : Or rather to subclass QueryComponent? This smells like the poster child for an X/Y problem (or maybe an X/(Y OR Z) problem)... if you can elaborate a bit more on the type of enrichment you want to do, it's highly likely that your goal can be met w/o needing to write a custom plugin (i'm thinking particularly of the multitudes of parsers solr already has, local params, and variable substitution) http://people.apache.org/~hossman/#xyproblem XY Problem Your question appears to be an XY Problem ... that is: you are dealing with X, you are assuming Y will help you, and you are asking about Y without giving more details about the X so that we can understand the full issue. Perhaps the best solution doesn't involve Y at all? See Also: http://www.perlmonks.org/index.pl?node_id=542341 -Hoss
Re: GET or POST for large queries?
Yes, I think it's 1024 by default. I think you can raise it in your config. But your performance may suffer. Best would be to try and find a better way to do what you want without using thousands of clauses. This might require some custom Java plugins to Solr though. On 2/17/2011 3:52 PM, mrw wrote: Yeah, I tried switching to POST. It seems to be handling the size, but apparently Solr has a limit on the number of boolean comparisons -- I'm now getting too many boolean clauses errors emanating from org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108). :) Thanks for responding. Erik Hatcher-4 wrote: Yes, you may use POST to make search requests to Solr. Erik
optimize and mergeFactor
In my own Solr 1.4, I am pretty sure that running an index optimize does give me significant better performance. Perhaps because I use some largeish (not huge, maybe as large as 200k) stored fields. So I'm interested in always keeping my index optimized. Am I right that if I set mergeFactor to '1', essentially my index will always be optimized after every commit, and actually running 'optimize' will be redundant? What are the possible negative repurcussions of setting mergeFactor to 1? Is this a really bad idea? If not 1, what about some other lower-than-usually-recommended value like 2 or 3? Anyone done this? I imagine it will slow down my commits, but if the alternative is running optimize a lot anyway I wonder at what point I get 'break even' (if I optimize after every single commit, clearly might as well just set the mergeFactor low, right? But if I optimize after every X documents or Y commits don't know what X/Y are break-even). Jonathan
Re: optimize and mergeFactor
Thanks for the answers, more questions below. On 2/16/2011 3:37 PM, Markus Jelsma wrote: 200.000 stored fields? I asume that number includes your number of documents? Sounds crazy =) Nope, I wasn't clear. I have less than a dozen stored field, but the value of a stored field can sometimes be as large as 200kb. You can set mergeFactor to 2, not lower. Am I right though that manually running an 'optimize' is the equivalent of a mergeFactor=1? So there's no way to get Solr to keep the index in an 'always optimized' state, if I'm understanding correctly? Cool. Just want to understand what's going on. This depends on commit rate and if there are a lot of updates and deletes instead of adds. Setting it very low will indeed cause a lot of merging and slow commits. It will also be very slow in replication because merged files are copied over again and again, causing high I/O on your slaves. There is always a `break even` but it depends (as usual) on your scenario and business demands. There are indeed sadly lots of updates and deletes, which is why I need to run optimize periodically. I am aware that this will cause more work for replication -- I think this is true whether I manually issue an optimize before replication _or_ whether I just keep the mergeFactor very low, right? Same issue either way. So... if I'm going to do lots of updates and deletes, and my other option is running an optimize before replication anyway is there any reason it's going to be completely stupid to set the mergeFactor to 2 on the master? I realize it'll mean all index files are going to have to be replicated, but that would be the case if I ran a manual optimize in the same situation before replication too, I think. Jonathan
Re: Solr multi cores or not
Solr multi-core essentially just lets you run multiple seperate distinct Solr indexes in the same running Solr instance. It does NOT let you run queries accross multiple cores at once. The cores are just like completely seperate Solr indexes, they are just conveniently running in the same Solr instance. (Which can be easier and more compact to set up than actually setting up seperate Solr instances. And they can share some config more easily. And it _may_ have implications on JVM usage, not sure). There is no good way in Solr to run a query accross multiple Solr indexes, whether they are multi-core or single cores in seperate Solr doesn't matter. Your first approach should be to try and put all the data in one Solr index. (one Solr 'core'). Jonathan On 2/16/2011 3:45 PM, Thumuluri, Sai wrote: Hi, I have a need to index multiple applications using Solr, I also have the need to share indexes or run a search query across these application indexes. Is solr multi-core - the way to go? My server config is 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the recommendation? Thanks, Sai Thumuluri
minimum Solr slave replication config
Solr 1.4.1. So, from the documentation at http://wiki.apache.org/solr/SolrReplication I was wondering if I could get away without having any actual configuration in my slave at all. The replication handler is turned on, but if I'm going to manually trigger replication pulls while supplying the master URL manually with the command too, by: command=fetchIndexmasterUrl=$solr_master Then I was thinking, gee, maybe I don't need any slave config at all. That _appears_ to not be true. In such a situation, when I tell the slave to fetchIndexmasterUrl=$solr_master, the command gives a 200 OK. But then I go and check /replication?command=details on the slave, I'm actually presented with an exception: message null java.lang.NullPointerException at org.apache.solr.handler.ReplicationHandler.isPollingDisabled(ReplicationHandler.java:412) at So I'm thinking this is probably becuase you actually can't get away with no slave config at all. So: 1) Is this a bug? Maybe I did something I shoudn't have, but having command=details report a NullPointerException is probably not good, right? If someone who knows better agrees, I'll file it in JIRA? 2) Does anyone know what the minimal slave config is? If I plan to manually trigger replication pulls, and supply the masterUrl maybe just an empty lst name=slave/lst. Or are there other parameters I have to set even though I don't plan to use them? (I do not want automatic polling, only manually triggered pulls). Anyone have any advice, or should I just trial and error?
Re: Solr multi cores or not
Yes, you're right, from now on when I say that, I'll say except shards. It is true. My understanding is that shards functionality's intended use case is for when your index is so large that you want to split it up for performance. I think it works pretty well for that, with some limitations as you mention. From reading the list, my impression is that when people try to use shards to solve some _other_ problem, they generally run into problems. But maybe that's just because the people with the problems are the ones who appear on the list? My personal advice is still to try and put everything together in one big index, Solr will give you the least trouble with that, it's what Solr likes to do best; move to shards certainly if your index is so large that moving to shards will give you performance advantage you need, that's what they're for; be very cautious moving to shards for other challenges that 'one big index' is giving you that you're thinking shards will solve. Shards is, as I understand it, _not_ intended as a general purpose federation function, it's specifically intended to split an index accross multiple hosts for performance. Jonathan On 2/16/2011 4:37 PM, Bob Sandiford wrote: Hmmm. Maybe I'm not understanding what you're getting at, Jonathan, when you say 'There is no good way in Solr to run a query across multiple Solr indexes'. What about the 'shards' parameter? That allows searching across multiple cores in the same instance, or shards across multiple instances. There are certainly implications here (like Relevance not being consistent across cores / shards), but it works pretty well for us... Thanks! Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, February 16, 2011 4:09 PM To: solr-user@lucene.apache.org Cc: Thumuluri, Sai Subject: Re: Solr multi cores or not Solr multi-core essentially just lets you run multiple seperate distinct Solr indexes in the same running Solr instance. It does NOT let you run queries accross multiple cores at once. The cores are just like completely seperate Solr indexes, they are just conveniently running in the same Solr instance. (Which can be easier and more compact to set up than actually setting up seperate Solr instances. And they can share some config more easily. And it _may_ have implications on JVM usage, not sure). There is no good way in Solr to run a query accross multiple Solr indexes, whether they are multi-core or single cores in seperate Solr doesn't matter. Your first approach should be to try and put all the data in one Solr index. (one Solr 'core'). Jonathan On 2/16/2011 3:45 PM, Thumuluri, Sai wrote: Hi, I have a need to index multiple applications using Solr, I also have the need to share indexes or run a search query across these application indexes. Is solr multi-core - the way to go? My server config is 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the recommendation? Thanks, Sai Thumuluri
Re: Multicore boosting to only 1 core
No. In fact, there's no way to search over multi-cores at once in Solr at all, even before you get to your boosting question. Your different cores are entirely different Solr indexes, Solr has no built-in way to combine searches accross multiple Solr instances. [Well, sort of it can, with sharding. But sharding is unlikely to be a solution to your problem either, UNLESS you problem is that your solr index is so big you want to split it accross multiple machines for performance. That is the problem sharding is meant to solve. People trying to use it to solve other problems run into trouble.] On 2/14/2011 1:59 PM, Tanner Postert wrote: I have a multicore system and I am looking to boost results by date, but only for 1 core. Is this at all possible? Basically one of the core's content is very new, and changes all the time, and if I boost everything by date, that core's content will almost always be at the top of the results, so I only want to do the date boosting to the cores that have older content so that their more recent results get boosted over the older content.
Re: schema.xml configuration for file names?
You can't just send arbitrary XML to Solr for update, no. You need to send a Solr Update Request in XML. You can write software that transforms that arbitrary XML to a Solr update request, for simple cases it could even just be XSLT. There are also a variety of other mediator pieces that come with Solr for doing updates; you can send updates in comma-seperated-value format, or you can use Direct Import Handler to, in some not-too-complicated cases, embed the translation from your arbitrary XML to Solr documents in your Solr instance itself. But you can't just send arbitrary XML to the Solr update handler, no. No matter what method you use to send documents to solr, you're going to have to think about what you want your Solr schema to look like -- what fields of what types. And then map your data to it. In Solr, unlike in an rdbms, what you want your schema to look like has a lot to do with what kinds of queries you will want it to support, it can't just be done based on the nature of the data alone. Jonathan On 2/15/2011 12:45 PM, alan bonnemaison wrote: Erick, I think you put the finger on the problem. Our XML files (we get from our suppliers) do *not* look like that. That's what a typical file looks like insert_list...resultresult outcome=PASS/resultparameter_liststring_parameter name=SN value=NOVAL /string_parameter name=RECEIVER value=000907010391 /string_parameter name=Model value=R16-500 /...string_parameter name=WorkCenterID value=PREP /string_parameter name=SiteID value=CTCA /string_parameter name=RouteID value=ADV /string_parameter name=LineID value=Line5 //parameter_listconfig enable_sfcs_comm=true enable_param_db_comm=false force_param_db_update=false driver_platform=LABVIEW mode=PROD driver_revision=2.0/config/insert_list Obviously, nothing likeadddoc/doc/add By the way, querying q=*:* retrieved HTTP error 500 Null pointer exception, which leads me to believe that my index is 100% empty. What I am trying to do cannot be done, correct? I just don't want to waste anyone's time. Thanks, Alan. On Tue, Feb 15, 2011 at 6:01 AM, Erick Ericksonerickerick...@gmail.comwrote: Can we see a small sample of an xml file you're posting? Because it should look something like add doc field name=stbmodelR16-500/field more fields here. /doc /add Take a look at the Solr admin page after you've indexed data to see what's actually in your index, I suspect what's in there isn't what you expect. Try querying q=*:* just for yucks to see what the documents returned look like. I suspect your index doesn't contain anything like what you think, but that's only a guess... Best Erick On Mon, Feb 14, 2011 at 7:15 PM, alan bonnemaisonkg6...@gmail.com wrote: Hello! We receive from our suppliers hardware manufacturing data in XML files. On a typical day, we got 25,000 files. That is why I chose to implement Solr. The file names are made of eleven fields separated by tildas like so CTCA~PRE~PREP~1010123~ONTDTVP5A~41~P~R16-500~000912239878~20110125~212321.XML Our RD guys want to be able search each field of the file XML file names (OR operation) but they don't care to search the file contents. Ideally, they would like to do a query all files where stbmodel equal to R16-500 or result is P or filedate is 20110125...you get the idea. I defined in schema.xml each data field like so (from left to right -- sorry for the long list): field name=location type=textgen indexed=false stored=true multiValued=false/ field name=scriptid type=textgen indexed=false stored=true multiValued=false/ field name=slotid type=textgen indexed=false stored=true multiValued=false/ field name=workcenter type=textgen indexed=false stored=false multiValued=false/ field name=workcenterid type=textgen indexed=false stored=fase multiValued=false/ field name=result type=string indexed=true stored=truemultiValued=false/ field name=computerid type=textgen indexed=false stored=true multiValued=false/ field name=stbmodel type=textgen indexed=true stored=truemultiValued=false/ field name=receiver type=string indexed=true stored=truemultiValued=false/ field name=filedate type=textgen indexed=false stored=true multiValued=false/ field name=filetime type=textgen indexed=false stored=true multiValued=false/ Also, I defined as unique key the field receiver. But no results are returned by my queries. I made sure to update my index like so: java -jar apache-solr-1.4.1/example/exampledocs/post.jar *XML. I am obviously missing something. Is there a way to configure schema.xml to search for file names? I welcome your input. Al.
RE: Concurrent updates/commits
Solr does handle concurrency fine. But there is NOT transaction isolation like you'll get from an rdbms. All 'pending' changes are (conceptually, anyway) held in a single queue, and any commit will commit ALL of them. There isn't going to be any data corruption issues or anything from concurrent adds (unless there's a bug in Solr, there isn't supposed to be) -- but there is no kind of transactions or isolation between different concurrent adders. So, sure, everyone can add concurrently -- but any time any of those actors issues a commit, all pending adds are committed. In addition, there are problems with Solr's basic architecture and _too frequent_ commits (whether made by different processes or not, doesn''t matter). When a new commit happens, Solr fires up a new index searcher and warms it up on the new version of the index. Until the new index searcher is fully warmed, the old index searcher is still serving queries. Which can also mean that there are, for this period, TWO versions of all your caches in RAM and such. So let's say it takes 5 minutes for the new index to be fully warmed. But if you have commits happening every 1 minute -- then you'll end up with FIVE 'new indexes' being warmed -- meaning potentially 5 times the RAM usage (quickly running into a JVM out of memory error), lots of CPU activity going on warming indexes that will never actually been used (because even though they aren't even done being warmed and ready to use, they've already been superseded by a later commit). I don't know of any good way to deal with this except less frequent commits. One way to get less frequent commits is to use Solr replication, and 'stage' all your commits in a 'master' index, but only replicate to 'slave' at a frequency slow enough so the new index is fully warmed before the next commit happens. Some new features in trunk (both lucene and solr) for 'near real time' search ameliorate this problem somewhat, depending on the nature of your commits. Jonathan From: Savvas-Andreas Moysidis [savvas.andreas.moysi...@googlemail.com] Sent: Wednesday, February 09, 2011 10:34 AM To: solr-user@lucene.apache.org Subject: Concurrent updates/commits Hello, This topic has probably been covered before here, but we're still not very clear about how multiple commits work in Solr. We currently have a requirement to make our domain objects searchable immediately after the get updated in the database by some user action. This could potentially cause multiple updates/commits to be fired to Solr and we are trying to investigate how Solr handles those multiple requests. This thread: http://search-lucene.com/m/0cab31f10Mh/concurrent+commitssubj=commit+concurrency+full+text+search suggests that Solr will handle all of the lower level details and that Before a *COMMIT* is done , lock is obtained and its released after the operation which in my understanding means that Solr will serialise all update/commit requests? However, the Solr book, in the Commit, Optimise, Rollback section reads: if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit which suggests that requests are *not* serialised. Our questions are: - Does Solr handle concurrent requests or do we need to add synchronisation logic around our code? - If Solr *does* handle concurrent requests, does it serialise each request or has some other strategy for processing those? Thanks, - Savvas
RE: relational db mapping for advanced search
I have no great answer for you, this is to me a generally unanswered question, hard to do Solr with this sort of thing, I think you seem to understand it properly. There ARE some interesting new features in trunk (not 1.4) that may be relevant, although to my perspective none of them provide magic bullet solutions. But there is a 'join' feature which could be awfully useful with the setup you suggest of having different 'types' of documents all together in the same index. https://issues.apache.org/jira/browse/SOLR-2272 From: Scott Yeadon [scott.yea...@anu.edu.au] Sent: Tuesday, February 08, 2011 4:41 PM To: solr-user@lucene.apache.org Subject: relational db mapping for advanced search Hi, I was just after some advice on how to map some relational metadata to a solr index. The web application I'm working on is based around people and the searching based around properties of these people. Several properties are more complex - for example, a person's occupations have place, from/to dates and other descriptive text; texts about a person have authors, sources and publication dates. Despite the usefulness of facets and the search-based navigation, an advanced search feature is a non-negotiable required feature of the application. An advanced search needs to be able to query a person on any set of attributes (e.g. gender, birth date, death date, place of birth) etc including the more complex search criteron as described above (occupation, texts). Taking occupation as an example, because occupation has its own metadata and a person could have worked an arbitrary number of occupations throughout their lifetime, I was wondering how/if this information can be denormalised into a single person index document to support such a search. I can't use text concatenation in a multivalued field as I need to be able to run date-based range queries (e.g. publication dates, occupation dates). And I'm not sure that resorting to multiple repeated fields based on the current limits (e.g. occ1, occ1startdate, occ1enddate, occ1place, occ2, etc) is a good approach (although that would work). If there isn't a sensible way to denormalise this, what is the best approach? For example, should I have an occupation document type, a person document type, a text/source document type and (in an advanced search context) each containing the relevant person id and (in the advanced search context) run a query against each document type and then use the intersecting set of person ids as the result used by the application for its display/pagination? And if so, how do I ensure I capture all records - for example if there are 100,000 hits on someone having worked in Australia in 1956, is there any way to ensure all 100,000 are returned in a query (similar to the facet.limit = -1) other than specifying an arbitrary high number in the rows parameter and hoping a query doesn't hit more than 100,000 and thus exclude those above the limit from the intersect processing? Or is there a single query solution? Any advice/hints welcome. Scott.
RE: prices
Your prices are just dollars and cents? For actual queries, you might consider an int type rather than a float type. Multiple by a hundred to put it in the index, then multiply your values in queries by a hundred before putting them in the query. Same for range facetting, just divide by 100 before display of anything you get back. Fixed precision values like price values aren't really floats or don't really need floats, and floats sometimes do weird things, as you've noticed. Alternately if your problem is simply that you want to display 2.0 as 2.00 rather than 2 or 2.0, that is something for you to take care of in your PHP app that does the display. PHP will have some function for formatting numbers and saying with what precision you want to display. There is no way to keep two trailing zeroes 'in' a float field, because 2.0 or 2. is the same value as 2.00 or 2.00, so they've all got the same internal representation in the float field. There is no way I know to tell Solr what precision to render floats with in it's responses. From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley [yo...@lucidimagination.com] Sent: Friday, February 04, 2011 1:49 PM To: solr-user@lucene.apache.org Subject: Re: prices On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon gear...@sbcglobal.net wrote: Using solr 1.4. I have a price in my schema. Currently it's a tfloat. Somewhere along the way from php, json, solr, and back, extra zeroes are getting truncated along with the decimal point for even dollar amounts. So I have two questions, neither of which seemed to be findable with google. A/ Any way to keep both zeroes going inito a float field? (In the analyzer, with XML output, the values are shown with 1 zero) B/ Can strings be used in range queries like a float and work well for prices? You could do a copyField into a stored string field and use the tfloat (or tint and store cents) for range queries, searching, etc, and the string field just for display. -Yonik http://lucidimagination.com Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: chaning schema
It could be related Tomcat. I've had inconsistent experiences there too, I _thought_ I could delete just the contents of the data/ directory, but at some point I realized that wasn't working, confusing me as to whether I was remembering correctly that deleting just the contents ever worked. At the moment, on my setup, I definitely need to delete the whole data/ directory . At one point I switched my setup from jetty to tomcat, but at about the same point I switched my setup from single core to multi-core too. So it could be a multi-core thing too (which seems somewhat more likely than jetty vs tomcat making a difference). Or it could be something completely else that none of us know, I just report my limited observations from experience. :) Jonathan On 2/3/2011 8:17 AM, Erick Erickson wrote: Erik: Is this a Tomcat-specific issue? Because I regularly delete just the data/index directory on my Windows box running Jetty without any problems. (3_x and trunk) Mostly want to know because I just encouraged someone to just delete the index dir based on my experience... Thanks Erick On Tue, Feb 1, 2011 at 12:24 PM, Erik Hatchererik.hatc...@gmail.comwrote: the trick is, you have to remove the data/ directory, not just the data/index subdirectory. and of course then restart Solr. or delete *:*?commit=true, depending on what's the best fit for your ops. Erik On Feb 1, 2011, at 11:41 , Dennis Gearon wrote: I tried removing the index directory once, and tomcat refused to sart up because it didn't have a segments file. - Original Message From: Erick Ericksonerickerick...@gmail.com To: solr-user@lucene.apache.org Sent: Tue, February 1, 2011 5:04:51 AM Subject: Re: chaning schema That sounds right. You can cheat and just removesolr_home/data/index rather than delete *:* though (you should probably do that with the Solr instance stopped) Make sure to remove the directory index as well. Best Erick On Tue, Feb 1, 2011 at 1:27 AM, Dennis Gearongear...@sbcglobal.net wrote: Anyone got a great little script for changing a schema? i.e., after changing: database, the view in the database for data import the data-config.xml file the schema.xml file I BELIEVE that I have to run: a delete command for the whole index *:* a full import and optimize This all sound right? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: OAI on SOLR already done?
The trick is that you can't just have a generic black box OAI-PMH provider on top of any Solr index. How would it know where to get the metadata elements it needs, such as title, or last-updated date, etc. Any given solr index might not even have this in stored fields -- and a given app might want to look them up from somewhere other than stored fields. If the Solr index does have them in stored fields, and you do want to get them from the stored fields, then it's, I think (famous last words) relatively straightforward code to write. A mapping from solr stored fields to metadata elements needed for OAI-PMH, and then simply outputting the XML template with those filled in. I am not aware of anyone that has done this in a re-useable/configurable-for-your-solr tool. You could possibly do it solely using the built-in Solr JSP/XSLT/other-templating-stuff-I-am-not-familiar-with stuff, rather than as an external Solr client app, or it could be an external Solr client app. This is actually a very similar problem to something someone else asked a few days ago Does anyone have an OpenSearch add-on for Solr? Very very similar problem, just with a different XML template for output (usually RSS or Atom) instead of OAI-PMH. On 2/2/2011 3:14 PM, Paul Libbrecht wrote: Peter, I'm afraid your service is harvesting and I am trying to look at a PMH provider service. Your project appeared early in the goolge matches. paul Le 2 févr. 2011 à 20:46, Péter Király a écrit : Hi, I don't know whether it fits to your need, but we are builing a tool based on Drupal (eXtensible Catalog Drupal Toolkit), which can harvest with OAI-PMH and index the harvested records into Solr. The records is harvested, processed, and stored into MySQL, then we index them into Solr. We created some ways to manipulate the original values before sending to Solr. We created it in a modular way, so you can change settings in an admin interface or write your own hooks (special Drupal functions), to taylor the application to your needs. We support only Dublin Core, and our own FRBR-like schema (called XC schema), but you can add more schemas. Since this forum is about Solr, and not applications using Solr, if you interested this tool, plase write me a private message, or visit http://eXtensibleCatalog.org, or the module's page at http://drupal.org/project/xc. Hope this helps, Péter eXtensible Catalog 2011/2/2 Paul Libbrechtp...@hoplahup.net: Hello list, I've met a few google matches that indicate that SOLR-based servers implement the Open Archive Initiative's Metadata Harvesting Protocol. Is there something made to be re-usable that would be an add-on to solr? thanks in advance paul
Re: OAI on SOLR already done?
On 2/2/2011 5:19 PM, Dennis Gearon wrote: Does something like this work to extract dates, phone numbers, addresses across international formats and languages? Or, just in the plain ol' USA? What are you talking about? There is nothing discussed in this thread that does any 'extracting' of dates, phone numbers or addresses at all , whether in international or domestic formats.
RE: DismaxParser Query
Yes, I think nested queries are the only way to do that, and yes, nested queries like Daniel's example work (I've done it myself). I haven't really tried to get into understanding/demonstrating _exactly_ how the relevance ends up working on the overall master query in such a situation, but it sort of works. (Just note that Daniel's example isn't quite right, I think you need double quotes for the nested _query_, just check the wiki page/blog post on nested queries). Does eDismax handle parens for order of operation too? If so, eDismax is probably the best/easiest solution, especially if you're trying to parse an incoming query from some OTHER format and translate it to something that can be sent to Solr, which is what I often do. I haven't messed with eDismax myself yet. Does anyone know if there's any easy (easy!) way to get eDismax in a Solr 1.4? Any easy way to compile an eDismax query parser on it's own that works with Solr 1.4, and then just drop it into your local lib/ for use with an existing Solr 1.4? Jonathan From: Daniel Pötzinger [daniel.poetzin...@aoemedia.de] Sent: Thursday, January 27, 2011 9:26 AM To: solr-user@lucene.apache.org Subject: AW: DismaxParser Query It may also be an option to mix the query parsers? Something like this (not tested): q={!lucene}field1:test OR field2:test2 _query_:{!dismax qf=fields}+my dismax -bad So you have the benefits of lucene and dismax parser -Ursprüngliche Nachricht- Von: Erick Erickson [mailto:erickerick...@gmail.com] Gesendet: Donnerstag, 27. Januar 2011 15:15 An: solr-user@lucene.apache.org Betreff: Re: DismaxParser Query What version of Solr are you using, and could you consider either 3x or applying a patch to 1.4.1? Because eDismax (extended dismax) handles the full Lucene query language and probably works here. See the Solr JIRA 1553 at https://issues.apache.org/jira/browse/SOLR-1553 Best Erick On Thu, Jan 27, 2011 at 8:32 AM, Isan Fulia isan.fu...@germinait.comwrote: It worked by making mm=0 (it acted as OR operator) but how to handle this field1:((keyword1 AND keyword2) OR (keyword3 AND keyword4)) OR field2:((keyword1 AND keyword2) OR (keyword3 AND keyword4)) OR field3:((keyword1 AND keyword2) OR (keyword3 AND keyword4)) On 27 January 2011 17:06, lee carroll lee.a.carr...@googlemail.com wrote: sorry ignore that - we are on dismax here - look at mm param in the docs you can set this to achieve what you need On 27 January 2011 11:34, lee carroll lee.a.carr...@googlemail.com wrote: the default operation can be set in your config to be or or on the query something like q.op=OR On 27 January 2011 11:26, Isan Fulia isan.fu...@germinait.com wrote: but q=keyword1 keyword2 does AND operation not OR On 27 January 2011 16:22, lee carroll lee.a.carr...@googlemail.com wrote: use dismax q for first three fields and a filter query for the 4th and 5th fields so q=keyword1 keyword 2 qf = field1,feild2,field3 pf = field1,feild2,field3 mm=something sensible for you defType=dismax fq= field4:(keyword3 OR keyword4) AND field5:(keyword5) take a look at the dismax docs for extra params On 27 January 2011 08:52, Isan Fulia isan.fu...@germinait.com wrote: Hi all, The query for standard request handler is as follows field1:(keyword1 OR keyword2) OR field2:(keyword1 OR keyword2) OR field3:(keyword1 OR keyword2) AND field4:(keyword3 OR keyword4) AND field5:(keyword5) How the same above query can be written for dismax request handler -- Thanks Regards, Isan Fulia. -- Thanks Regards, Isan Fulia. -- Thanks Regards, Isan Fulia.
Re: How to edit / compile the SOLR source code
[Btw, this is great, thank you so much to Solr devs for providing simple ant-based compilation, and not making me install specific development tools and/or figure out how to use maven to compile, like certain other java projects. Just make sure ant is installed and 'ant dist', I can do that! I more or less know how to write Java, at least for simple things, but I still have trouble getting the right brew of required Java dev tools working properly to compile some projects! ] On 1/26/2011 4:19 PM, Erick Erickson wrote: Sure, at the top level (above src) you should be able to just type ant dist, then look in the dist directory ant there should be a solrversion.war Best Erick On Wed, Jan 26, 2011 at 11:43 AM, Anuraganurag.it.jo...@gmail.com wrote: Actually i also want to edit Source Files of Solr.Does that mean i will have to go in Src directory of Solr and then rebuild using ant? I need not compile them or Ant will do the whole compiling as well as updating the jar files? i have the following files in Solr-1.3.0 directory /home/anurag/apache-solr-1.3.0/build /home/anurag/apache-solr-1.3.0/client /home/anurag/apache-solr-1.3.0/contrib /home/anurag/apache-solr-1.3.0/dist /home/anurag/apache-solr-1.3.0/docs /home/anurag/apache-solr-1.3.0/example /home/anurag/apache-solr-1.3.0/lib /home/anurag/apache-solr-1.3.0/src /home/anurag/apache-solr-1.3.0/build.xml /home/anurag/apache-solr-1.3.0/CHANGES.txt /home/anurag/apache-solr-1.3.0/common-build.xml /home/anurag/apache-solr-1.3.0/KEYS.txt /home/anurag/apache-solr-1.3.0/LICENSE.txt /home/anurag/apache-solr-1.3.0/NOTICE.txt /home/anurag/apache-solr-1.3.0/README.txt and i want to edit the source code to implement my things. How should i proceed? - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-edit-compile-the-SOLR-source-code-tp477584p2355270.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: in-index representaton of tokens
Why does it matter? You can't really get at them unless you store them. I don't know what table per column means, there's nothing in Solr architecture called a table or a column. Although by column you probably mean more or less Solr field. There is nothing like a table in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: So, the index is a list of tokens per column, right? There's a table per column that lists the analyzed tokens? And the tokens per column are represented as what, system integers? 32/64 bit unsigned ints? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: EdgeNgram Auto suggest - doubles ignore
I haven't figured out any way to achieve that AT ALL without making a seperate Solr index just to serve autosuggest queries. At least when you want to auto-suggest on a multi-value field. Someone posted a crazy tricky way to do it with a single-valued field a while ago. If you can/are willing to make a seperate Solr index with a schema set up for auto-suggest specifically, it's easy. But from an existing schema, where you want to auto-suggest just based on the values in one field, it's a multi-valued field, and you want to allow matches in the middle of the field -- I don't think there's a way to do it. On 1/25/2011 3:03 PM, johnnyisrael wrote: Hi Eric, What I want here is, lets say I have 3 documents like [pineapple vers apple, milk with apple, apple milk shake ] and If i search for apple, it should return only apple milk shake because that term alone starts with the letter apple which I typed in. It should not bring others and if I type milk it should return only milk with apple I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny
Re: EdgeNgram Auto suggest - doubles ignore
Ah, sorry, I got confused about your requirements, if you just want to match at the beginning of the field, it may be more possible. Using edgegrams or wildcard. If you have a single-valued field. Do you have a single-valued or a multi-valued field? That is, does each document have just one value, or multiple? I still get confused about how to do it with edgegrams, even with single-valued field, but I think maybe it's possible. _Definitely_ possible, with or without edgegrams, if you are willing/able to make a completely seperate Solr index where each term for auto-suggest is a document. Yes. The problem lies in what results are. In general, Solr's results are the documents you have in the Solr index. Thus it makes everything a lot easier to deal with if you have an index where each document in the index is a term for auto-suggest. But that doesnt' always meet requirements if you need to auto-suggest within existing fq's and such, and of course it takes more resources to run an additional solr index. On 1/25/2011 5:03 PM, mesenthil wrote: The index contains around 1.5 million documents. As this is used for autosuggest feature, performance is an important factor. So it looks like, using edgeNgram it is difficult to achieve the the following Result should return only those terms where search letter is matching with the first word only. For example, when we type M, it should return Mumford and Sons and not jackson Michael. Jonathan, Is it possible to achieve this when we have separate index using edgeNgram?
Re: Specifying optional terms with standard (lucene) request handler?
With the 'lucene' query parser? include q.op=OR and then put a + (mandatory) in front of every term in the 'q' that is NOT optional, the rest will be optional. I think that will do what want. Jonathan On 1/25/2011 5:07 PM, Daniel Pötzinger wrote: Hi I am searching for a way to specify optional terms in a query ( that dont need to match (But if they match should influence the scoring) ) Using the dismax parser a query like this: str name=mm2/str str name=debugQueryon/str str name=q+lorem ipsum dolor amet/str str name=qfcontent/str str name=hl.fl/ str name=qtdismax/str Will be parsed into something like this: str name=parsedquery_toString +((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) () /str Which will result that only 2 of the 3 optional terms need to match? How can optional terms be specified using the standard request handler? My concrete requirement is that a certain term should match but another is optional. But if the optional part matches - it should give the document an extra score. Something like :-) str name=qcontent:lorem #optional#content:optionalboostword^10/str An idea would be to use a function query to boost the document: str name=q content:lorem _val_:query({!lucene v='optionalword^20'}) /str Which will result in: str name=parsedquery_toString +content:forum +query(content:optionalword^20.0,def=0.0) /str Is this a good way or are there other suggestions? Thanks for any opinion and tips on this Daniel
RE: in-index representaton of tokens
There aren't any tables involved. There's basically one list (per field) of unique tokens for the entire index, and also, a list for each token of which documents contain that token. Which is efficiently encoded, but I don't know the details of that encoding, maybe someone who does can tell you, or you can look at the lucene source, or get one of the several good books on lucene. These 'lists' are set up so you can efficiently look up a token, and see what documents contain that token. That's basically what lucene does, the purpose of lucene. Oh, and then there's term positions and such too, so not only can you see what documents contain that token but you can do proximity searches and stuff. This all gets into lucene implementation details I am not familiar with though. Why do you want to know? If you have specific concerns about disk space or RAM usage or something and how different schema choices effect it, ask them, and someone can probably tell you more easily than someone can explain the total architecture of lucene in a short listserv message. But, hey, maybe someone other than me can do that too! From: Dennis Gearon [gear...@sbcglobal.net] Sent: Tuesday, January 25, 2011 7:02 PM To: solr-user@lucene.apache.org Subject: Re: in-index representaton of tokens I am saying there is a list of tokens that have been parsed (a table of them) for each column? Or one for the whole index? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind rochk...@jhu.edu To: solr-user@lucene.apache.org solr-user@lucene.apache.org Sent: Tue, January 25, 2011 9:29:36 AM Subject: Re: in-index representaton of tokens Why does it matter? You can't really get at them unless you store them. I don't know what table per column means, there's nothing in Solr architecture called a table or a column. Although by column you probably mean more or less Solr field. There is nothing like a table in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: So, the index is a list of tokens per column, right? There's a table per column that lists the analyzed tokens? And the tokens per column are represented as what, system integers? 32/64 bit unsigned ints? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: Taxonomy in SOLR
There aren't any great general purpose out of the box ways to handle hieararchical data in Solr. Solr isn't an rdbms. There may be some particular advice on how to set up a particular Solr index to answer particular questions with regard to hieararchical data. I saw a great point made recently comparing rdbms to NoSQL stores, which applied to Solr too even though Solr is NOT a noSQL store. In rdbms, you set up your schema thinking only about your _data_, and modelling your data as flexibly as possible. Then once you've done that, you can ask pretty much any well-specified question you want of your data, and get a correct and reasonably performant answer. In Solr, on the other hand, we set up our schemas to answer particular questions. You have to first figure out what kinds of questions you will want to ask Solr, what kinds of queries you'll want to make, and then you can figure out how to structure your data to ask those questions. Some questions are actually very hard to set up Solr to answer -- in general Solr is about setting up your data so whatever question you have can be reduced to asking is token X in field Y. This can be especially tricky in cases where you want to use a single Solr index to answer multiple questions, where the questions are such that you really need to set up your data _differently_ to get Solr to optimally answer each question. Solr is not a general purpose store like an rdbms, where you can set up your schema once in terms of your data and use it to answer nearly any conceivable well-specified question after that. Instead, Solr does things that rdbms can't do quickly or can't do at all. But you lose some things too. On 1/24/2011 3:03 AM, Damien Fontaine wrote: Hi, I am trying Solr and i have one question. In the schema that i set up, there are 10 fields with always same data(hierarchical taxonomies) but with 4 million documents, space disk and indexing time must be big. I need this field for auto complete. Is there another way to do this type of operation ? Damien
RE: filter update by IP
My favorite other external firewall'ish technology is just an apache front-end reverse proxying to the Java servlet (such as Solr), with access controls in apache. I haven't actually done it with Solr myself though, my Solr is behind a firewall accessed by trusted apps only. Be careful making your Solr viewable to the world, even behind an other external firewall'ish technology. There are several features in Solr you do NOT to expose to the world (the ability to change the index in general, of which there are a variety of ways to do it in addition to the /update/csv handler, the straight /update handler. Also consider the replication commands -- the example Solr solrconfig.xml, at least, will allow an HTTP request that tells Solr to replicate from arbitrarily specified 'master', definitely not something you'd want open to the world either! There may be other examples too you might not think of at first.). My impression is that Solr is written assuming it will be safely ensconced behind a firewall and accessed by trusted applications only. If you're not going to do this, you're going to have to be careful to make sure to lock down or remove a lot of things, /update/csv is just barely a start. I don't know if anyone has analyzed and written up secure ways to do this -- it sounds like there would be interest for such since it keeps coming up on the list. Kind of personally curious _why_ it keeps coming up on the list so much. Is everyone trying to go into business vending Solr in the cloud to customers who will write their own apps, or are there some other less obvious (to me) use cases? From: Erik Hatcher [erik.hatc...@gmail.com] Sent: Sunday, January 23, 2011 1:47 PM To: solr-user@lucene.apache.org Subject: Re: filter update by IP No. SolrQueryRequest doesn't (currently) have access to the actual HTTP request coming in. You'll need to do this either with a servlet filter and register it into web.xml or restrict it from some other external firewall'ish technology. Erik On Jan 23, 2011, at 13:21 , Teebo wrote: Hi I would like to restrict access to /update/csv request handler Is there a ready to use UpdateRequestProcessor for that ? My first idea was to heritate from CSVRequestHandler and to overload public void handleRequest(SolrQueryRequest req, SolrQueryResponse rsp) { ... restrict by IP code ... super(req, rsp); } What do you think ? Regards, t.
RE: api key filtering
If you COULD solve your problem by indexing 'public', or other tokens from a limited vocabulary of document roles, in a field -- then I'd definitely suggest you look into doing that, rather than doing odd things with Solr instead. If the only barrier is not currently having sufficient logic at the indexing stage to do that, then it is going to end up being a lot less of a headache in the long term to simply add a layer at the indexing stage to add that in, then trying to get Solr to do things outside of it's, well, 'comfort zone'. Of course, depending on your requirements, it might not be possible to do that, maybe you can't express the semantics in terms of a limited set of roles applied to documents. And then maybe your best option really is sending an up to 2k element list (not exactly the same list every time, presumably) of acceptable documents to Solr with every query, and maybe you can get that to work reasonably. Depending on how many different complete lists of documents you have, maybe there's a way to use Solr caches effectively in that situation, or maybe that's not even neccesary since lookup by unique id should be pretty quick anyway, not really sure. But if the semantics are possible, much better to work with Solr rather than against it, it's going to take a lot less tinkering to get Solr to perform well if you can just send an fq=role:public or something, instead of a list of document IDs. You won't need to worry about it, it'll just work, because you know you're having Solr do what it's built to do. Totally worth a bit of work to add a logic layer at the indexing stage. IMO. From: Erick Erickson [erickerick...@gmail.com] Sent: Saturday, January 22, 2011 4:50 PM To: solr-user@lucene.apache.org Subject: Re: api key filtering 1024 is the default number, it can be increased. See MaxBooleanClauses in solrconfig.xml This shouldn't be a problem with 2K clauses, but expanding it to tens of thousands is probably a mistake (but test to be sure). Best Erick On Sat, Jan 22, 2011 at 3:50 PM, Matt Mitchell goodie...@gmail.com wrote: Hey thanks I'll definitely have a read. The only problem with this though, is that our api is a thin layer of app-code, with solr only (no db), we index data from our sql db into solr, and push the index off for consumption. The only other idea I had was to send a list of the allowed document ids along with every solr query, but then I'm sure I'd run into a filter query limit. Each key could be associated with up to 2k documents, so that's 2k values in an fq which would probably be too many for lucene (I think its limit 1024). Matt On Sat, Jan 22, 2011 at 3:40 PM, Dennis Gearon gear...@sbcglobal.net wrote: The only way that you would have that many api keys per record, is if one of them represented 'public', right? 'public' is a ROLE. Your answer is to use RBAC style techniques. Here are some links that I have on the subject. What I'm thinking of doing is: Sorry for formatting, Firefox is freaking out. I cut and pasted these from an email from my sent box. I hope the links came out. Part 1 http://www.xaprb.com/blog/2006/08/16/how-to-build-role-based-access-control-in-sql/ Part2 Role-based access control in SQL, part 2 at Xaprb ACL/RBAC Bookmarks ALL UserRbac - symfony - Trac A Role-Based Access Control (RBAC) system for PHP Appendix C: Task-Field Access Role-based access control in SQL, part 2 at Xaprb PHP Access Control - PHP5 CMS Framework Development | PHP Zone Linux file and directory permissions MySQL :: MySQL 5.0 Reference Manual :: C.5.4.1 How to Reset the Root Password per RECORD/Entity permissions? - symfony users | Google Groups Special Topics: Authentication and Authorization | The Definitive Guide to Yii | Yii Framework att.net Mail (gear...@sbcglobal.net) Solr - User - Modelling Access Control PHP Generic Access Control Lists Row-level Model Access Control for CakePHP « some flot, some jet Row-level Model Access Control for CakePHP « some flot, some jet Yahoo! GeoCities: Get a web site with easy-to-use site building tools. Class that acts as a client to a JSON service : JSON « GWT « Java Juozas Kaziukėnas devBlog Re: [symfony-users] Implementing an existing ACL API in symfony php - CakePHP ACL Database Setup: ARO / ACO structure? - Stack Overflow W3C ACL System makeAclTables.sql SchemaWeb - Classes And Properties - ACL Schema Reardon's Ruminations: Spring Security ACL Schema for Oracle trunk/modules/auth/libraries/Khacl.php | Source/SVN | Assembla Acl.php - kohana-mptt - Project Hosting on Google Code Asynchronous JavaScript Technology and XML (Ajax) With the Java Platform The page cannot be found Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea
Re: Which QueryParser to use
On 1/20/2011 1:42 AM, kun xiong wrote: Thar example string means our query is BooleanQuery containing BooleanQuerys. I am wondering how to write a complicated BooleanQuery for dismax, like (A or B or C) and (D or E) Or I have to use Lucene query parser. You can't do it with dismax. You might be able to do it with edismax, which is in Solr trunk/4.0 or as a patch to 1.4. You can also do it, in 1.4, with nested queries with dismax queries nested in a 'lucene' query. But why would you want to? What do you actually want to do? The dismax parser is great for taking user-entered queries and just sending them straight to Solr. Is that why you're interested in it? It's also a convenient way to search a query over multiple fields with different boosts in different fields, or with other useful boosts like phrase boosts and such. Is that why you're interested in it? Or something else? Depending on what you want from it, the easiest solution may be different. Or if you don't want _anything_ from it, and are happy with a straight lucene-style query, then there's no reason to do use it, just use the straight 'lucene' query parser, no problem.
Re: Showing facet values in alphabetical order
Are you showing the facets with facet parameters in your request? Then you can ask for the facets to be returned sorted by byte-order with facet.sort=index. Got nothing to do with your schema, let alone your DIH import configuration that you showed us. Just a matter of how you ask Solr for facets. Byte order is not neccesarily exactly 'alphabetical' order, if your facets are not 7-bit-ascii and/or if they contain punctuation. If your facet values are just 7-bit ascii characters and spaces, it should basically be alphabetical order. But that's all that Solr offers, as far as I know. On 1/20/2011 12:34 PM, PeterKerk wrote: I want to provide a list of facets to my visitors order alphabetically, for example, for the 'features' facet I have: data-config.xml: entity name=location_feature query=select featureid from location_features where locationid='${location.id}' entity name=feature query=select title from features where id = '${location_feature.featureid}' ORDER BY title ASC field name=features column=title / /entity /entity schema.xml: field name=features type=textTight indexed=true stored=true multiValued=true/ field name=features_raw type=string indexed=true stored=true multiValued=true/ copyField source=features dest=features_raw/ But this doesnt give me the facets in an alphabetical order. Besides the features facet, I also have some other facets that ALSO need to be shown in alphabetical order. How to approach this?
Re: Adding weightage to the facets count
Maybe?: Just keep the 'weightages' in an external store of some kind (rdbms, nosql like mongodb, just a straight text config file that your app loads into a hash internally, whatever), rather than Solr, and have your app look them up for each facet value to be displayed, after your app fetches the facet values from Solr. There's no need to use Solr for this, although there might be various tricky ways to do so if you really wanted to, there's no perfectly straightforward way. On 1/20/2011 12:39 PM, sivaprasad wrote: Hi, I am building tag cloud for products by using facets.I made tag names as facets and i am taking facets count as reference to display tag cloud.Each product has tags with their own weightage.Let us say, For example prod1 has tag called “Light Weight” with weightage 20, prod2 has tag called “Light Weight” with weightage 100, If i get facet for “Light Weight” , i will get Light Weight (2) , here i need to consider the weightage in to account, and the result will be Light Weight (120) How can we achieve this?Any ideas are really helpful. Regards, Siva
Re: Indexing all permutations of words from the input
Why do you want to do this, what is it meant to accomplish? There might be a better way to accomplish what it is you are trying to do; I can't think of anything (which doesn't mean it doesn't exist) that what you're actually trying to do would be required in order to do. What sorts of queries do you intend to serve with this setup? I don't believe there is any analyzer that will do exactly what you've specified, included with Solr out of the box. You could definitely write your own analyzer in Java to do it. But I still suspect you may not actually need to construct your index like that to accomplish whatever you are trying to accomplish. The only point I can think of to caring what words are next to what other words is for phrase and proximity searches. However, with what you've specified, phrase and proximity searches wouldn't be at all useful anyway, as EVERY word would be next to every other word, so any phrase or proximity search including any words present at all would match, so might as well not do a phrase and proximity search at all, in which case it should not matter what order or how close together the words are in the index. Why not just use an ordinary Whitespace Tokenizer, and just do ordinary dismax or lucene queries without using phrase or proximity? On 1/20/2011 4:03 PM, Martin Jansen wrote: Hey there, I'm looking for ananalyzer configuration for Solr 1.4 that accomplishes the following: Given the input abc xyz foo I would like to add at least the following token combinations to the index: abc abc xyz abc xyz foo abc foo xyz xyz foo foo A WhitespaceTokenizer combined with a ShingleFilter will take me there to some extent, but won't e.g. add abc foo to the index. Is there a way to do this? - Martin
Re: Indexing all permutations of words from the input
Aha, I have no idea if there actually is a better way of achieving that, auto-completion with Solr is always tricky and I personally have not been happy with any of the designs I've seen suggested for it. But I'm also not entirely sure your design will actually work, but neither am I sure it won't! I am thinking maybe for that auto-complete use, you will actually need your field to be NOT tokenized, so you won't want to use the WhiteSpace tokenizer after all (I think!) -- unless maybe there's another filter you can put at the end of the chain that will take all the tokens and join them back together, seperated by a single space, as a single token. But I do think you'll need the whole multi-word string to be a single token in order to use terms.prefix how you want. If you can't make ShingleFilter do it though, I don't think there is any built in analyzers that will do the transformation you want. You could write your own in Java, perhaps based on ShingleFilter -- or it might be easier to have your own software make the transformations you want and then simply send the pre-transformed strings to Solr when indexing. Then you could simply send them to a 'string' type field that won't tokenize. On 1/20/2011 4:40 PM, Martin Jansen wrote: On 20.01.11 22:19, Jonathan Rochkind wrote: On 1/20/2011 4:03 PM, Martin Jansen wrote: I'm looking for ananalyzer configuration for Solr 1.4 that accomplishes the following: Given the input abc xyz foo I would like to add at least the following token combinations to the index: abc abc xyz abc xyz foo abc foo xyz xyz foo foo Why do you want to do this, what is it meant to accomplish? There might be a better way to accomplish what it is you are trying to do; I can't think of anything (which doesn't mean it doesn't exist) that what you're actually trying to do would be required in order to do. What sorts of queries do you intend to serve with this setup? I'm in the process of setting up an index for term suggestion. In my use case people should get the suggestion abc foo for the search query abc fo and under the assumption that abc xyz foo has been submitted to the index. My current plan is to use TermsComponent with the terms.prefix= parameter for this, because it seems to be pretty efficient and I get things like correct sorting for free. I assume there is a better way for achieving this then? - Martin
Re: Opensearch Format Support
No, not exactly. In general, people don't expose their Solr API direct to the world -- they front Solr with some software that is exposed to the world. (If you do expose your Solr API directly to the world, you will need to think carefully about security, and make sure you aren't letting anyone in the world do things you don't want them to do to your Solr index, like commit new documents!). It would not be all that hard to write software that searches Solr on the backend via an OpenSearch interface -- an OpenSearch interface is basically just results in Atom format, usually. And then just an OpenSearch Description document that just specifies what your search URL is. You'd have to have things like 'title' or 'last updated' or whatever other fields you want in your Atom result in Solr stored fields, if you wanted to get them purely from Solr -- and you'd have to tell this hypothetical OpenSearch front end what Solr stored fields to use for what elements in the Atom response. So it's not something where some software could just go on top of any Solr index at all and provide a valid Atom or RSS response (which is basically all OpenSearch is). I do not know if anyone else has already written an open source configurable atom/opensearch front-end to Solr, you could try googling around. But it would not be a very difficult task for a programmer familiar with Solr and with OpenSearch/Atom/RSS. Jonathan On 1/20/2011 4:29 PM, Tod wrote: Does Solr support the Opensearch format? If so could someone point me to the correct documentation? Thanks - Tod
Re: Return all contents from collection
I know that this is often a performance problem -- but Erick, I am interested in the 'better solution' you hint at! There are a variety of cases where you want to 'dump' all documents from a collection. One example might be in order to build a Google SiteMap for your app that's fronting your Solr. That's mine at the moment. If anyone can think of a way to do this that doesn't have horrible performance (and bonus points if it doesn't completely mess up caches too by filling them with everything), that would be awesome. Jonathan On 1/18/2011 8:47 PM, Erick Erickson wrote: This is usually a bad idea, but if you really must use q=*:*start=0rows=100 Assuming that there are fewer than 1,000,000 documents in your index. And if there are more, you won't like the performance anyway. Why do you want to do this? There might be a better solution. Best Erick On Tue, Jan 18, 2011 at 7:58 PM, Dan Baughmanda...@hostworks.com wrote: Is there a way I can simply tell the index to return its entire record set? I tried starting and ending with just a * but no dice.
Re: Local param tag voodoo ?
What query are you actually trying to do? There's probably a way to do it, possibly using nested queries -- but not using illegal syntax like some of your examples! If you explain what you want to do, someone may be able to tell you how. From the hints in your last message, I suspect nested queries _might_ be helpful to you. On 1/19/2011 3:46 AM, Xavier SCHEPLER wrote: Ok I was already at this point. My facetting system use exactly what is described in this page. I read it from the Solr 1.4 book. Otherwise I would'nt ask. The problem is that the filter queries doesn't affect the relevance score of the results so I want the terms in the main query. From: Markus Jelsmamarkus.jel...@openindex.io Sent: Tue Jan 18 21:31:52 CET 2011 To:solr-user@lucene.apache.org Subject: Re: Local param tag voodoo ? Hi, You get an error because LocalParams need to be in the beginning of a parameter's value. So no parenthesis first. The second query should not give an error because it's a valid query. Anyway, i assume you're looking for : http://wiki.apache.org/solr/SimpleFacetParameters#Multi- Select_Faceting_and_LocalParams Cheers, Hey, here are my needs : - a query that has tagged and untagged contents - facets that ignore the tagged contents I tryed : q=({!tag=toExclude} ignored) taken into account q={tag=toExclude v='ignored'} take into account Both resulted in a error. Is this possible or do I have to try another way ? -- Tous les courriers électroniques émis depuis la messagerie de Sciences Po doivent respecter des conditions d'usages. Pour les consulter rendez-vous sur http://www.ressources-numeriques.sciences-po.fr/confidentialite_courriel.htm
Re: unix permission styles for access control
No. There is no built in way to address 'bits' in Solr that I am aware of. Instead you can think about how to transform your data at indexing into individual tokens (rather than bits) in one or more field, such that they are capable of answering your query. Solr works in tokens as the basic unit of operation (mostly, basically), not characters or bytes or bits. On 1/19/2011 9:48 AM, Dennis Gearon wrote: Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'. So 'fieldName.x' is how to address bits? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsent...@statsbiblioteket.dk To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Sent: Wed, January 19, 2011 12:23:04 AM Subject: Re: unix permission styles for access control On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: I was wondering if the are binary operation filters? Haven't seen any in the book nor was able to find any using google. So if I had 0600(octal) in a permission field, and I wanted to return any records that 'permission 0400(octal)==TRUE', how would I filter that? Don't you mean permission 0400(octal) == 0400? Anyway, the functionality can be accomplished by extending your index a bit. You could split the permission into user, group and all parts, then use an expanded query. If the permission is 0755 it will be indexed as user_p:7 group_p:5 all_p:5 If you're searching for something with at least 0650 your query should be expanded to (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) Alternatively you could represent the bits explicitly in the index: user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 Then a search for 0650 would query with user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 Finally you could represent all valid permission values, still split into parts with user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 The query would be simply user_p:6 AND group_p:5
Re: unix permission styles for access control
Yep, that's what I'm suggesting as one possible approach to consider, whether it will work or not depends on your specifics. Character length in a token doesn't really matter for solr performance. It might be less confusing to actually put read update delete own (or whatever 'o' stands for) in a field, such that it will be tokenized so each of those words is a seperate token. (Make sure you aren't stemming or using synonyms, heh!). Or instead of seperating a single string into tokens, use a multi-valued String field, and put read, delete, etc in as seperate values. That is actually more straightforward and less confusing than tokenizing. Then you can just search for fq=permissions:read or whatever. Again, whether this will actually work for you depends on exactly what you're requirements are, but it's something to consider, before resorting to weird patches. It will work in any Solr version. The first approach to solving a problem in Solr should be trying to think Can I solve this by setting up my index in such a way that I can ask the questions I want simply by asking if a certain token is in a certain field? Because that's what Solr does, basically, tell you if certain tokens are in certain fields. If you can reduce the problem to that, Solr will handle it easily, simply, and efficiently. Otherwise, you might need weird patches. :) On 1/19/2011 12:45 PM, Dennis Gearon wrote: So, if I used something like r-u-d-o in a field (read,update,delete,others) I could get it tokenized to those four characters,and then search for those in that field. Is that what you're suggesting, (thanks by the way). An article I read created a 'hybrid' access control system (can't remember if it was ACL or RBAC). It used a primary system like Unix file system 9bit permission for the primary permissions normally needed on most objects of any kind, and then flagged if there were any other permissions and any other groups. It was very fast for the primary permissons, and fast for the secondary. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkindrochk...@jhu.edu To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Sent: Wed, January 19, 2011 8:40:30 AM Subject: Re: unix permission styles for access control No. There is no built in way to address 'bits' in Solr that I am aware of. Instead you can think about how to transform your data at indexing into individual tokens (rather than bits) in one or more field, such that they are capable of answering your query. Solr works in tokens as the basic unit of operation (mostly, basically), not characters or bytes or bits. On 1/19/2011 9:48 AM, Dennis Gearon wrote: Sorry for repeat, trying to make sure this gets on the newsgroup to 'all'. So 'fieldName.x' is how to address bits? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Toke Eskildsent...@statsbiblioteket.dk To: solr-user@lucene.apache.orgsolr-user@lucene.apache.org Sent: Wed, January 19, 2011 12:23:04 AM Subject: Re: unix permission styles for access control On Wed, 2011-01-19 at 08:15 +0100, Dennis Gearon wrote: I was wondering if the are binary operation filters? Haven't seen any in the book nor was able to find any using google. So if I had 0600(octal) in a permission field, and I wanted to return any records that 'permission 0400(octal)==TRUE', how would I filter that? Don't you mean permission 0400(octal) == 0400? Anyway, the functionality can be accomplished by extending your index a bit. You could split the permission into user, group and all parts, then use an expanded query. If the permission is 0755 it will be indexed as user_p:7 group_p:5 all_p:5 If you're searching for something with at least 0650 your query should be expanded to (user_p:7 OR user_p:6) AND (group_p:7 OR group_p:5) Alternatively you could represent the bits explicitly in the index: user_p:1 user_p:2 user_p:4 group_p:1 group_p:4 all_p:1 all_p:5 Then a search for 0650 would query with user_p:2 AND user_p:4 AND group_p:1 AND group_p:4 Finally you could represent all valid permission values, still split into parts with user_p:1 user_p:2 user_p:3 user_p:4 user_p:5 user_p:6 user_p:7 group_p:1 group_p:2 group_p:3 group_p:4 group_p:5 all_p:1 all_p:2 all_p:3 all_p:4 all_p:5 The query would be simply user_p:6 AND group_p:5
Re: facet or filter based on user's history
The problem is going to be 'near real time' indexing issues. Solr 1.4 at least does not do a very good job of handling very frequent commits. If you want to add to the user's history in the Solr index ever time they click the button, and they click the button a lot, and this naturally leads to commits very frequent commits to Solr (every minute, every second, multiple times a second), you're going to have RAM and performance problems. I believe there are some things in trunk that make handling this better, don't know the details but near real time search is what people talk about, to google or ask on this list. Or, if it's acceptable for your requirements, you could record all the I've read this clicks in an external store, and only add them to the Solr index nightly, or even hourly. If you batch em and add em as frequently as you can get away with (every hour sure, every 10 minutes pushing it, every minute, no), you can get around that issue. Or for that matter you could ADD em to Solr but only 'commit' every hour or whatever, but I don't like that strategy since if Solr crashes or otherwise restarts you pretty much lose those pending commits, better to queue em up in an external store. On 1/19/2011 1:52 PM, Markus Jelsma wrote: Hi, I've never seen Solr's behaviour with a huge amount of values in a multi valued but i think it should work alright. Then you can stored a list of user ID's along with each book document and user filter queries to include or exclude the book from the result set. Cheers, Hi, I'm looking for ideas on how to make an efficient facet query on a user's history with respect to the catalog of documents (something like Read document already: yes / no). The catalog is around 100k titles and there are several thousand users. Of course, each user has a different history, many having read fewer than 500 titles, but some heavy users having read perhaps 50k titles. Performance is not terribly important right now so all I did was bump up the boolean query limit and put together a big string of document id's that the user has read. The first query is slow but once it's in the query cache it's fine. I would like to find a better way of doing it though. What type of solr plugin would be best suited to helping in this situation? I could make a function plugin that provides something like hasHadBefore() - true/false, but would that be efficient for faceting and filtering? Another idea is a QParserPlugin that looks for a field like hasHadBefore:userid and somehow substitutes in the list of docs. But I'm not sure how a new parser plugin would interact with the existing parser. Can solr use a parser plugin to only handle one field, and leave all the other fields to the default parser? Thanks, Jon
Re: performance during index switch
During commit? A commit (and especially an optimize) can be expensive in terms of both CPU and RAM as your index grows larger, leaving less CPU for querying, and possibly less RAM which can cause Java GC slowdowns in some cases. A common suggestion is to use Solr replication to seperate out a Solr index that you index to, and then replicate to a slave index that actually serves your queries. This should minimize any performance problems on your 'live' Solr while indexing, although there's still something that has to be done for the actual replication of course. Haven't tried it yet myself. Plan to -- my plan is actually to put them both on the same server (I've only got one), but in seperate JVMs, and on a server with enough CPU cores that hopefully the indexing won't steal CPU the querying needs. On 1/19/2011 2:23 PM, Tri Nguyen wrote: Hi, Are there performance issues during the index switch? As the size of index gets bigger, response time slows down? Are there any studies on this? Thanks, Tri
Re: performance during index switch
On 1/19/2011 2:56 PM, Tri Nguyen wrote: Yes, during a commit. I'm planning to do as you suggested, having a master do the indexing and replicating the index to a slave which leads to my next questions. During the slave replicates the index files from the master, how does it impact performance on the slave? That I am not certain, because I haven't done it yet myself, but I am optimistic it will be tolerable. As with any commit, when the slave replicates it will temporarily make a second copy of any changed index files (possibly the whole index), and it will then set up new searchers on the new copy of the index, and it will warm that new index, and then once warmed, it'll switch live searches over to the new index, and delete any old copies of indexes. So you may still need a bunch of 'extra' RAM in the JVM to accomodate that overlap period. You will need some extra diskspace. But the actual CPU I mean, it will take some CPU for the slave to run the new warmers, but it should be tolerable not very noticeable... I'm hoping. One main benefit of the replication setup is that you can _optimize_ on the master, which will be completely out of the way of the slave. Even with the replication setup, you still can't commit (ie pull down changes from master) near real time in 1.4 though, you can't commit so often that a new index is not done warming when a new commit comes in, or your Solr will grind to a halt as it uses too much CPU and RAM. There are various ways people have suggested you can try to work around this, but I havne't been too happy with any of em, I think it's best just not to commit/pull down changes from master that often. Unless you REALLY need to, and are prepared to get into details of Solr to figure out how to make it work as well as it can.
Re: Search on two core and two schema
Solr can't do that. Two cores are two seperate cores, you have to do two seperate queries, and get two seperate result sets. Solr is not an rdbms. On 1/18/2011 12:24 PM, Damien Fontaine wrote: I want execute this query : Schema 1 : field name=id type=string indexed=true stored=true required=true / field name=title type=string indexed=true stored=true required=true / field name=UUID_location type=string indexed=true stored=true required=true / Schema 2 : field name=UUID_location type=string indexed=true stored=true required=true / field name=label type=string indexed=true stored=true required=true / field name=type type=string indexed=true stored=true required=true / Query : select?facet=truefl=titleq=title:*facet.field=UUID_locationrows=10qt=standard Result : ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime0/int lst name=params str name=facettrue/str str name=fltitle/str str name=qtitle:*/str str name=facet.fieldUUID_location/str str name=qtstandard/str /lst /lst result name=response numFound=1889 start=0 doc str name=titletitre 1/str /doc doc str name=titleTitre 2/str /doc /result lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=UUID_location int name=Japan998/int int name=China891/int /lst /lst lst name=facet_dates/ /lst /response Le 18/01/2011 17:55, Stefan Matheis a écrit : Okay .. and .. now .. you're trying to do what? perhaps you could give us an example, w/ real data .. sample queries - results. because actually i cannot imagine what you want to achieve, sorry On Tue, Jan 18, 2011 at 5:24 PM, Damien Fontainedfonta...@rosebud.frwrote: On my first schema, there are informations about a document like title, lead, text etc and many UUID(each UUID is a taxon's ID) My second schema contains my taxonomies with auto-complete and facets. Le 18/01/2011 17:06, Stefan Matheis a écrit : Search on two cores but combine the results afterwards to present them in one group, or what exactly are you trying to do Damien? On Tue, Jan 18, 2011 at 5:04 PM, Damien Fontainedfonta...@rosebud.fr wrote: Hi, I would like make a search on two core with differents schemas. Sample : Schema Core1 - ID - Label - IDTaxon ... Schema Core2 - IDTaxon - Label - Hierarchy ... Schemas are very differents, i can't group them. Have you an idea to realize this search ? Thanks, Damien
Re: StopFilterFactory and qf containing some fields that use it and some that do not
It's a known 'issue' in dismax, (really an inherent part of dismax's design with no clear way to do anything about it), that qf over fields with different stop word definitions will produce odd results for a query with a stopword. Here's my understanding of what's going on: http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ On 1/12/2011 6:48 PM, Markus Jelsma wrote: Here's another thread on the subject: http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- td493483.html And slightly off topic: you'd also might want to look at using common grams, they are really useful for phrase queries that contain stopwords. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory Here is what debug says each of these queries parse to: 1. q=lifedefType=edismaxqf=Title ... returns 277,635 results 2. q=the lifedefType=edismaxqf=Title ... returns 277,635 results 3. q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 4. q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results 1. +DisjunctionMaxQuery((Title:life)) 2. +((DisjunctionMaxQuery((Title:life)))~1) 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) 4. +((DisjunctionMaxQuery((Contributor:the)) DisjunctionMaxQuery((Contributor:life | Title:life)))~2) I see what's going on here. Because the is a stop word for Title, it gets removed from first part of the expression. This means that Contributor is required to contain the. dismax does the same thing too. I guess I should have run debug before asking the mail list! It looks like the only workarounds I have is to either filter out the stopwords in the client when this happens, or enable stop words for all the fields that are used in qf with stopword-enabled fields. Unless...someone has a better idea?? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, January 12, 2011 4:44 PM To: solr-user@lucene.apache.org Cc: Jayendra Patil Subject: Re: StopFilterFactory and qf containing some fields that use it and some that do not Have used edismax and Stopword filters as well. But usually use the fq parameter e.g. fq=title:the life and never had any issues. That is because filter queries are not relevant for the mm parameter which is being used for the main query. Can you turn on the debugQuery and check whats the Query formed for all the combinations you mentioned. Regards, Jayendra On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James james.d...@ingrambook.comwrote: I'm running into a problem with StopFilterFactory in conjunction with (e)dismax queries that have a mix of fields, only some of which use StopFilterFactory. It seems that if even 1 field on the qf parameter does not use StopFilterFactory, then stop words are not removed when searching any fields. Here's an example of what I mean: - I have 2 fields indexed: Title is textStemmed, which includes StopFilterFactory (see below). Contributor is textSimple, which does not include StopFilterFactory (see below). - The is a stop word in stopwords.txt - q=lifedefType=edismaxqf=Title ... returns 277,635 results - q=the lifedefType=edismaxqf=Title ... returns 277,635 results - q=lifedefType=edismaxqf=Title Contributor ... returns 277,635 results - q=the lifedefType=edismaxqf=Title Contributor ... returns 0 results It seems as if the stop words are not being stripped from the query because qf contains a field that doesn't use StopFilterFactory. I did testing with combining Stemmed fields with not Stemmed fields in qf and it seems as if stemming gets applied regardless. But stop words do not. Does anyone have ideas on what is going on? Is this a feature or possibly a bug? Any known workarounds? Any advice is appreciated. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 fieldType name=textSimple class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=textStemmed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=0 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=0 splitOnNumerics=0 stemEnglishPossessive=1 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/
Re: verifying that an index contains ONLY utf-8
Scanning for only 'valid' utf-8 is definitely not simple. You can eliminate some obviously not valid utf-8 things by byte ranges, but you can't confirm valid utf-8 alone by byte ranges. There are some bytes that can only come after or before other certain bytes to be valid utf-8. There is no good way to do what you're doing, once you've lost track of what encoding something is in, you are reduced to applying heuristics to text strings to guess what encoding it is meant to be. There is no cheap way to do this to an entire Solr index, you're just going to have to fetch every single (stored field, indexed fields are pretty much lost to you) and apply heuristic algorithms to it. Keep in mind that Solr really probably shouldn't ever be used as your canonical _store_ of data; Solr isn't a 'store', it's an index. So you really ought to have this stuff stored somewhere else if you want to be able to examine it or modify it like this, and just deal with that somewhere else. This isn't really a Solr question at all, really, even if you are querying Solr on stored fields to try and guess their char encodings. There are various packages of such heuristic algorithms to guess char encoding, I wouldn't try to write my own. icu4j might include such an algorithm, not sure. On 1/13/2011 1:12 PM, Peter Karich wrote: take a look also into icu4j which is one of the contrib projects ... converting on the fly is not supported by Solr but should be relative easy in Java. Also scanning is relative simple (accept only a range). Detection too: http://www.mozilla.org/projects/intl/chardet.html We've created an index from a number of different documents that are supplied by third parties. We want the index to only contain UTF-8 encoded characters. I have a couple questions about this: 1) Is there any way to be sure during indexing (by setting something in the solr configuration?) that the documents that we index will always be stored in utf-8? Can solr convert documents that need converting on the fly, or can solr reject documents containing illegal characters? 2) Is there a way to scan the existing index to find any string containing non-utf8 characters? Or is there another way that I can discover if any crept into my index?
RE: verifying that an index contains ONLY utf-8
So you're allowed to put the entire original document in a stored field in Solr, but you aren't allowed to stick it in, say, a redis or couchdb too? Ah, beaurocracy. But no reason what you are doing won't work, as you of course already know from doing it. If you actually know the charset of a document when indexing it, you might want to consider putting THAT in a stored field; easier to keep track of the encoding you know then to try and guess it again later. From: Paul [p...@nines.org] Sent: Thursday, January 13, 2011 6:21 PM To: solr-user@lucene.apache.org Subject: Re: verifying that an index contains ONLY utf-8 Thanks for all the responses. CharsetDetector does look promising. Unfortunately, we aren't allowed to keep the original of much of our data, so the solr index is the only place it exists (to us). I do have a java app that reindexes, i.e., reads all documents out of one index, does some transform on them, then writes them to a second index. So I already have a place where I see all the data in the index stream by. I wanted to make sure there wasn't some built in way of doing what I need. I know that it is possible to fool the algorithm, but I'll see if the string is a possible utf-8 string first and not change that. Then I won't be introducing more errors and maybe I can detect a large percentage of the non-utf-8 strings. On Thu, Jan 13, 2011 at 4:36 PM, Robert Muir rcm...@gmail.com wrote: it does: http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html this takes a sample of the file and makes a guess.
RE: start value in queries zero or one based?
You could have tried it and seen for yourself on any Solr server in your possession in less time than it took to have this thread. And if you don't have a Solr server, then why do you care? But the answer is 0. http://wiki.apache.org/solr/CommonQueryParameters#start The default value is 0 Since the default start is 0, and if you leave start out you don't always skip the first item of your result set, that means if you DO want to skip the first item if your result set, start=1 will do it. From: Dennis Gearon [gear...@sbcglobal.net] Sent: Thursday, January 13, 2011 6:04 PM To: solr-user@lucene.apache.org Subject: Re: start value in queries zero or one based? I'm migrating to CTO/CEO status in life due to building a small company. I find I don't have too much time for theory. I work with wht is. So, what is it, not what should it be. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Walter Underwood wun...@wunderwood.org To: solr-user@lucene.apache.org Sent: Thu, January 13, 2011 1:38:26 PM Subject: Re: start value in queries zero or one based? On Jan 13, 2011, at 1:28 PM, Dennis Gearon wrote: Do I even need a body for this message? ;-) Dennis Gearon Are you asking is it or should it be? If the latter, we can also discuss Emacs and vi. wunder -- Walter Underwood K6WRU
Re: pruning search result with search score gradient
Some times I've _considered_ trying to do this (but generally decided it wasn't worth it) was when I didn't want those documents below the threshold to show up in the facet values. In my application the facet counts are sometimes very pertinent information, that are sometimes not quite as useful as they could be when they include barely-relevant hits. On 1/12/2011 11:42 AM, Erick Erickson wrote: What's the use-case you're trying to solve? Because if you're still showing results to the user, you're taking information away from them. Where are you expecting to get the list? If you try to return the entire list, you're going to pay the penalty of creating the entire list and transmitting it across the wire rather than just a pages' worth. And if you're paging, the user will do this for you by deciding for herself when she's getting less relevant results. So I don't understand what the value to the end user you're trying to provide is, perhaps if you elaborate on that I'll have more useful response Best Erick On Tue, Jan 11, 2011 at 3:12 AM, Julien Piquotjulien.piq...@arisem.comwrote: Hi everyone, I would like to be able to prune my search result by removing the less relevant documents. I'm thinking about using the search score : I use the search scores of the document set (I assume there are sorted by descending order), normalise them (0 would be the the lowest value and 1 the greatest value) and then calculate the gradient of the normalised scores. The documents with a gradient below a threshold value would be rejected. If the scores are linearly decreasing, then no document is rejected. However, if there is a brutal score drop, then the documents below the drop are rejected. The threshold value would still have to be tuned but I believe it would make a much stronger metric than an absolute search score. What do you think about this approach? Do you see any problem with it? Is there any SOLR tools that could help me dealing with that? Thanks for your answer. Julien
Re: Improving Solr performance
I see a lot of people using shards to hold different types of documents, and it almost always seems to be a bad solution. Shards are intended for distributing a large index over multiple hosts -- that's it. Not for some kind of federated search over multiple schemas, not for access control. Why not put everything in the same index, without shards, and just use an 'fq' limit in order to limit to the specific document you'd like to search over in a given search?I think that would achieve your goal a lot more simply than shards -- then you use sharding only if and when your index grows to be so large you'd like to distribute it over multiple hosts, and when you do so you choose a shard key that will have more or less equal distribution accross shards. Using shards for access control or schema management just leads to headaches. [Apparently Solr could use some highlighted documentation on what shards are really for, as it seems to be a very common issue on this list, someone trying to use them for something else and then inevitably finding problems with that approach.] Jonathan On 1/7/2011 6:48 AM, supersoft wrote: The reason of this distribution is the kind of the documents. In spite of having the same schema structure (and solr conf), a document belongs to 1 of 5 different kinds. Each kind corresponds to a concrete shard and due to this, the implemented client tool avoids searching in all the shards when the users selects just one or a few of kinds. The tool runs a multisharded query of the proper shards. I guess this is a right approach but correct me if I am wrong. The real problem of this architecture is the correlation between concurrent users and response time: 1 query: n seconds 2 queries: 2*n second each query 3 queries: 3*n seconds each query and so... This is being a real headache because 1 single query has an acceptable response time but when many users are accessing to the server the performance goes hardly down.
Re: Tuning StatsComponent
I found StatsComponent to be slow only when I didn't have enough RAM allocated to the JVM. I'm not sure exactly what was causing it, but it was pathologically slow -- and then adding more RAM to the JVM made it incredibly fast. On 1/10/2011 4:58 AM, Gora Mohanty wrote: On Mon, Jan 10, 2011 at 2:28 PM, stockiist...@shopgate.com wrote: Hello. i`m using the StatsComponent to get the sum of amounts. but solr statscomponent is very slow on a huge index of 30 Million documents. how can i tune the statscomponent ? Not sure about this problem. the problem is, that i have 5 currencys and i need to send for each currency a new request. thats make the solr search sometimes very slow. =( [...] I guess that you mean the search from the front-end is slow. It is difficult to make a guess without details of your index, and of your queries, but one thing that immediately jumps out is that you could shard the Solr index by currency, and have your front-end direct queries for each currency to the appropriate Solr server. Please do share a description of what all you are indexing, how large your index is, and what kind of queries you are running. I take it that you have already taken a look at http://wiki.apache.org/solr/SolrPerformanceFactors Regards, Gora
Re: Improving Solr performance
On 1/10/2011 5:03 PM, Dennis Gearon wrote: What I seem to see suggested here is to use different cores for the things you suggested: different types of documents Access Control Lists I wonder how sharding would work in that scenario? Sharding has nothing to do with that scenario at all. Different cores are essentially _entirely seperate_. While it can be convenient to use different cores like this, it means you don't get ANY searches that 'join' over multiple 'kinds' of data in different cores. Solr is not great at handling hetereogenous data like that. Putting it in seperate cores is one solution, although then they are entirely seperate. If that works, great. Another solution is putting them in the same index, but using mostly different fields, and perhaps having a 'type' field shared amongst all of your 'kinds' of data, and then always querying with an 'fq' for the right 'kind'. Or if the fields they use are entirely different, you don't even need the fq, since a query on a certain field will only match a certain 'kind' of document. Solr is not great at handling complex queries over data with hetereogenous schemata. Solr wants you to to flatten all your data into one single set of documents. Sharding is a way of splitting up a single index (multiple cores are _multiple indexes_) amongst several hosts for performance reasons, mostly when you have a very large index. That is it. The end. if you have multiple cores, that's the same as having multiple solr indexes (which may or may not happen to be on the same machine). Any one or more of those cores could be sharded if you want. This is a seperate issue.
Re: Improving Solr performance
And I don't think I've seen anyone suggest a seperate core just for Access Control Lists. I'm not sure what that would get you. Perhaps a separate store that isn't Solr at all, in some cases. On 1/10/2011 5:36 PM, Jonathan Rochkind wrote: Access Control Lists
RE: (FQ) Filter Query Caching Differences with OR and AND?
Disclaimer: I am not actually familiar with the solr code, all of the below is extrapolation from being pretty familiar with Solr's behavior. Yeah, it would be nice, but it would be a lot harder to code for solr. Right now, the thing putting and retrieving entries into/from the filter cache doesn't really need to parse the query at all. It just takes the whole query and uses it (effectively) for a cache key. Keep in mind that Solr has pluggable query parsers, and the fq can (quite usefully!) be used with any query parser, not just lucene. lucene, dismax, field, raw, a few others out of the box, others not officially part of solr but that users might write and use with their solr. Query parsers can be in use (and work with filter cache) that didn't even exist when the filter caching logic was written. This is actually a very useful feature -- if there's behavior that's possible with lucene but not supported in a convenient way (or at all) by Solr API, you can write a query parser to do it yourself if you need to -- and your query parser will plug right in, and all other Solr features (such as filter caching!) will still work fine with it. So to get the filter caching to somehow go inside the query and cache and retrieve parts of it -- it would probably really need to be something each query parser were responsible for -- storing and retrieving elements from the filter cache as part of it's ordinary query parsing behavior -- but only when it was inside an fq, not a q, which I'm not sure the query parser even knows right now. Right now I think the query parser doesn't even have to know about the filter cache -- if an fq is retrieved from cache, then it doesn't even make it to the query parser. So yeah, it would be useful if seperate components of an fq query could be cached seperately -- but it would also be a lot more complicated. But I'm sure nobody would mind seeing a patch if you want to figure it out. :) From: Em [mailformailingli...@yahoo.de] Sent: Thursday, January 06, 2011 2:36 AM To: solr-user@lucene.apache.org Subject: Re: (FQ) Filter Query Caching Differences with OR and AND? Thank you Jonathan. fq=foo:barfq=foo:baz seems to be the better alternative for fq=foo:bar AND foo:baz if foo:bar and foo:baz were often used in different combinations (not always together). However, in most of the usecases I can think of, an fq=foo:bar OR foo:baz-behaviour is expected and it would be nice if this fq would benefit from a cached fq=foo:bar. I can imagine why this is not the case, if only one of two fq-clauses were cached. However, when foo:bar and foo:baz were cached seperately, why not benefiting from them when a fq=foo:bar OR foo:baz or fq=foo:bar AND foo:baz is requested? Who is responsible for putting fq's in the filterCache? I think one has to modify the logic of that class do benefit from already cached but recombined filterCaches. This would have a little bit less performance than caching the entire foo:bar AND foo:baz BitVector, since you need to reproduce one for that special use-case, but I think the usage of the cache is far more efficient, if foo:bar and foo:baz occur very frequently but foo:bar AND foo:baz do not. What do you think? Regards Jonathan Rochkind wrote: Each 'fq' clause is it's own cache key. 1. fq=foo:bar OR foo:baz = one entry in filter cache 2. fq=foo:barfq=foo:baz = two entries in filter cache, will not use cached entry from #1 3. fq=foo:bar = One entry, will use cached entry from #2 4. fq=foo:bar = One entry, will use cached entry from #2. So if you do queries in succession using each of those four fq's in order, you will wind up with 3 entries in the cache. Note that fq=foo:bar OR foo:baz is not semantically identical to fq=foofq=bar. Rather that latter is semantically identical to fq=foo:bar AND foo:baz. But fq=foofq=bar will be two cache entries, and fq=foo:bar AND foo:baz will be one cache entry, and the two won't share any cache entries. On 1/5/2011 3:17 PM, Em wrote: Hi, while reading through some information on the list and in the wiki, i found out that something is missing: When I specify a filter queries like this fq=foo:bar OR foo:baz or fq=foo:barfq=foo:baz or fq=foo:bar or fq=foo:baz How many filter query entries will be cached? Two, since there are two filters (foo:bar, foo:baz) or 3, since there are three different combinations (foo:bar OR foo:baz, foo:bar, foo:baz)? Thank you! Jonathan Rochkind wrote: Each 'fq' clause is it's own cache key. 1. fq=foo:bar OR foo:baz = one entry in filter cache 2. fq=foo:barfq=foo:baz = two entries in filter cache, will not use cached entry from #1 3. fq=foo:bar = One entry, will use cached entry from #2 4. fq=foo:bar = One entry, will use cached entry from #2. So if you do queries in succession using each of those four fq's in order, you will wind
Re: searching against unstemmed text
Do you have to do anything special to search against a field in Solr? No, that's what Solr does. Please be more specific about what you are trying to do, what you expect to happen, and what happens instead. If your Solr field is analyzed to stem, then indeed you can only match stemmed tokens, because that's the only tokens that are there. You can create a different solr field that is not stemmed for wildcard searches if you like, which is perhaps what you're trying to do, but you haven't really told us. On 1/4/2011 10:00 AM, Wodek Siebor wrote: I'm trying to search using text_rev field, which is by default enabled in the schema.xml, but it doesn't work at all. Do I have to do anything special here. I want to search using wildcards and searching against text field works fine, except I can only find results against stemmed text. Thanks, Wlodek
Re: Sub query using SOLR?
Yeah, I don't believe there's any good way to do it in Solr 1.4. You can make two queries, first make your 'sub' query, get back the list of values, then construct the second query where you do {!field v=field_name} val1 OR val2 OR val3 OR valN Kind of a pain, and there is a maximum number of conditions you can have in there (1024 maybe?). It is OFT requested behavior, and the feature on SOLR-2272 is very exciting to me and I think would meet a lot of needs, but I haven't tried it yet myself. Jonathan On 1/4/2011 2:03 PM, Steven A Rowe wrote: Hi Barani, I haven't tried it myself, but the limited JOIN functionality provided by SOLR-2272 sounds very similar to what you want to do: https://issues.apache.org/jira/browse/SOLR-2272 Steve -Original Message- From: bbarani [mailto:bbar...@gmail.com] Sent: Tuesday, January 04, 2011 1:27 PM To: solr-user@lucene.apache.org Subject: Sub query using SOLR? Hi, I am trying to use subquery in SOLR, is there a way to implement this using SOLR query syntax? Something like Related_id: IN query(field=ud, q=”type:IT AND manager_12:dave”) The thing I really want is to use output of one query to be the input of another query. Not sure if it is possible to use the query() function (function query) for my case.. Just want to know if ther is a better approach... Thanks, Barani -- View this message in context: http://lucene.472066.n3.nabble.com/Sub- query-using-SOLR-tp2193251p2193251.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Advice on Exact Matching?
There is a hacky kind of thing that Bill Dueber figured out for using multiple fields and dismax to BOOST exact matches, but include all matches in the result set. You have to duplicate your data in a second non-tokenized field. Then you use dismax pf to super boost matches on the non-tokenized field. Because 'pf' is a phrase search, you don't run into trouble with dismax pre-tokenization on white space, even though it's a field that might have internal-token whitespace. (Using a non-tokenized field with dismax qf will basically never match a result with whitespace, unless it's phrase-quoted in query. But pf works.). Because it was a non-tokenized field, it only matches (and triggers the dismax ps super boost) if it's an exact match. And it works. You CAN normalize your 'exact match' field in field analysis, removing punctuation or normalizing whitespace or whatever, and that works too, doing it both at index and query time analysis. On 1/4/2011 4:28 PM, Chris Hostetter wrote: : I am trying to make sure that when I search for text—regardless of : what that text is—that I get an exact match. I'm *still* getting some : issues, and this last mile is becoming very painful. The solr field, : for which I'm setting this up on, is pasted below my explanation. I : appreciate any help. if you are using a TextField with some analysis components, it's virtually impossible to get exact matches -- where my definition of exact is that the query text is character for character identical to the entire field value indexed. is your definition of exact match different? i assme it must be since you are using TextField and talk about wanting to deal with whitespace between words. so i think you need to explain a little bit better what your indexed data looks like, and what sample queries you expect to match that data (and equally important: what queries should *not* match thta data, and what data should *not* match those queries) : If I want to find *all* Solr documents that match : [id]somejunk\hi[/id] then life is instantly hell. 90% of the time when people have problems with exact matches it's because of QueryParser meta characters -- characters like :, [ and whitespace that the QUeryParser uses as instructions. you can use the raw QParser to have every character treated as a literal defType=raw q=[id]somejunk\hi[/id] -Hoss
Re: DIH and UTF-8
I haven't tried it yet, but I _think_ in Rails if you are using the 'mysql2' adapter (now standard with Rails3) instead of 'mysql', it might handle utf-8 better with less areas for gotchas. I think if the underlying mysql database is set to use utf-8, then, at least with mysql2 adapter, you shouldn't need to set 'encoding' attribute on the database connection definition. But I could be wrong, and this isn't really about solr anymore of course. On 12/29/2010 9:48 AM, Mark wrote: Sure thing. In my database.yml I was missing the encoding: utf8 option. If one were to add unicode characters within rails (console, web form, etc) the characters would appear to be saved correctly... ie when trying to retrieve them back, everything looked perfect. The characters also appeared correctly using the mysql prompt. However when trying to index or retrieve those characters using JDBC/Solr the characters were mangled. After adding the above utf8 encoding option I was able to correctly save utf8 characters into the database and retrieve them using JDBC/Solr. However when using the mysql client all the characters would show up as all mangled or as ''. This was resolved by running the following query set names utf8;. On 12/28/10 10:17 PM, Glen Newton wrote: Hi Mark, Could you offer a more technical explanation of the Rails problem, so that if others encounter a similar problem your efforts in finding the issue will be available to them? :-) Thanks, Glen PS. This has wandered somewhat off-topic to this list: apologies thanks for the patience of this list... On Tue, Dec 28, 2010 at 4:15 PM, Markstatic.void@gmail.com wrote: It was due to the way I was writing to the DB using our rails application. Everythin looked correct but when retrieving it using the JDBC driver it was all managled. On 12/27/10 4:38 PM, Glen Newton wrote: Is it possible your browser is not set up to properly display the chinese characters? (I am assuming you are looking at things through your browser) Do you have any problems viewing other chinese documents properly in your browser? Using mysql, can you see these characters properly? What happens when you use curl or wget to get a document from solr and looking at it using something besides your browser? Yes, I am running out of ideas! :-) -Glen On Mon, Dec 27, 2010 at 7:22 PM, Markstatic.void@gmail.com wrote: Just like the user of that thread... i have my database, table, columns and system variables all set but it still doesnt work as expected. Server version: 5.0.67 Source distribution Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql SHOW VARIABLES LIKE 'collation%'; +--+-+ | Variable_name| Value | +--+-+ | collation_connection | utf8_general_ci | | collation_database | utf8_general_ci | | collation_server | utf8_general_ci | +--+-+ 3 rows in set (0.00 sec) mysql SHOW VARIABLES LIKE 'character_set%'; +--++ | Variable_name| Value | +--++ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results| utf8 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/mysql/share/mysql/charsets/ | +--++ 8 rows in set (0.00 sec) Any other ideas? Thanks On 12/27/10 3:23 PM, Glen Newton wrote: [client] default-character-set = utf8 [mysql] default-character-set=utf8 [mysqld] character_set_server = utf8 character_set_client = utf8
RE: Solr 1.4.1 stats component count not matching facet count for multi valued field
Interesting, the wiki page on StatsComponent says multi-valued fields may be slow , and may use lots of memory. http://wiki.apache.org/solr/StatsComponent Apparently it should also warn that multi-valued fields may not work at all? I'm going to add that with a link to the JIRA ticket. From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Thursday, December 23, 2010 7:22 PM To: solr-user@lucene.apache.org Subject: Re: Solr 1.4.1 stats component count not matching facet count for multi valued field : I have a facet field called option which may be multi-valued and : a weight field which is single-valued. : : When I use the Solr 1.4.1 stats component with a facet field, i.e. ... : I get conflicting results for the stats count result a jira search for solr stats multivalued would have given you... https://issues.apache.org/jira/browse/SOLR-1782 -Hoss
RE: Solr 1.4.1 stats component count not matching facet count for multi valued field
Aha! Thanks, sorry, I'll clarify on my wiki edit. From: Chris Hostetter [hossman_luc...@fucit.org] Sent: Friday, December 24, 2010 12:11 AM To: solr-user@lucene.apache.org Subject: RE: Solr 1.4.1 stats component count not matching facet count for multi valued field : Interesting, the wiki page on StatsComponent says multi-valued fields : may be slow , and may use lots of memory. : http://wiki.apache.org/solr/StatsComponent *stats* over multivalued fields work, but use lots of memory -- that bug only hits you when you compute stats over any field, that are faceted by a multivalued field. -Hoss
RE: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria
This won't actually give you the number of distinct facet values, but will give you the number of documents matching your conditions. It's more equivalent to SQL without the distinct. There is no way in Solr 1.4 to get the number of distinct facet values. I am not sure about the new features in trunk. From: Peter Karich [peat...@yahoo.de] Sent: Wednesday, December 22, 2010 6:10 AM To: solr-user@lucene.apache.org Subject: Re: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria facets=truefacet.field=field // SELECT count(distinct(field)) fq=field:[* TO *] // WHERE length(field) 0 q=other_criteriaAfq=other_criteriaB// AND other_criteria advantage: you can look into several fields at one time when adding another facet.field disadvantage: you get the counts splitted by the values of that field fix this via field collapsing / results grouping http://wiki.apache.org/solr/FieldCollapsing or use deduplication: http://wiki.apache.org/solr/Deduplication Regards, Peter. Hi, Is there a way with faceting or field collapsing to do the SQL equivalent of SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria i.e. I'm only interested in the total count not the individual records and counts. Cheers, Dan -- http://jetwick.com open twitter search
Re: Duplicate values in multiValued field
In my experience, that should work fine. Facetting in 1.4 works fine on multi-valued fields, and a duplicate value in the multi-valued field shouldn't be a problem. On 12/22/2010 2:31 AM, Andy wrote: If I put duplicate values into a multiValued field, would that cause any issues? For example I have a multiValued field Color. Some of my documents have duplicate values for that field, such as: Green, Red, Blue, Green, Green. Would the above (having 3 duplicate Green) be the same as having the duplicated values of: Green, Red, Blue? Or do I need to clean my data and remove duplicate values before indexing? Thanks.
Re: White space in facet values
Another technique, which works great for facet fq's and avoids the need to worry about escaping, is using the field query parser instead: fq={!field f=Product}Electric Guitar Using the field query parser avoids the need for ANY escaping of your value at all, which is convenient in the facetting case -- you still need to URI-escape (ampersands for instance), but you shouldn't need to escape any Solr special characters like parens or double quotes or anything else, if you've made your string suitable for including in a URI. With the field query parser, a lot less to worry about. http://lucene.apache.org/solr/api/org/apache/solr/search/FieldQParserPlugin.html On 12/22/2010 9:53 AM, Dyer, James wrote: The phrase solution works as does escaping the space with a backslash: fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped like this (amperstands and parenthesis come to mind)... I assume you already have this indexed as string, not text... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Andy [mailto:angelf...@yahoo.com] Sent: Wednesday, December 22, 2010 1:11 AM To: solr-user@lucene.apache.org Subject: White space in facet values How do I handle facet values that contain whitespace? Say I have a field Product that I want to facet on. A value for Product could be Electric Guitar. How should I handle the white space in Electric Guitar during indexing? What about when I apply the constraint fq=Product:Electric Guitar?
Re: White space in facet values
Huh, does !term in 4.0 mean the same thing as !field in 1.4? What you describe as !term in 4.0 dev is what I understand as !field in 1.4 doing. On 12/22/2010 10:01 AM, Yonik Seeley wrote: On Wed, Dec 22, 2010 at 9:53 AM, Dyer, Jamesjames.d...@ingrambook.com wrote: The phrase solution works as does escaping the space with a backslash: fq=Product:Electric\ Guitar ... actually a lot of characters need to be escaped like this (amperstands and parenthesis come to mind)... One way to avoid escaping is to use the raw or term query parsers: fq={!raw f=Product}Electric Guitar In 4.0-dev, use {!term} since that will work with field types that need to transform the external representation into the internal one (like numeric fields need to do). http://wiki.apache.org/solr/SolrQuerySyntax -Yonik http://www.lucidimagination.com I assume you already have this indexed as string, not text... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Andy [mailto:angelf...@yahoo.com] Sent: Wednesday, December 22, 2010 1:11 AM To: solr-user@lucene.apache.org Subject: White space in facet values How do I handle facet values that contain whitespace? Say I have a field Product that I want to facet on. A value for Product could be Electric Guitar. How should I handle the white space in Electric Guitar during indexing? What about when I apply the constraint fq=Product:Electric Guitar?
Re: Solr query to get results based on the word length (letter count)
No good way. At indexing time, I'd just store the number of chars in the title in a field of it's own. You can possibly do that solely in schema.xml with clever use of analyzers and copyField. Solr isn't an rdbms. Best to de-normalize at index time so what you're going to want to query is in the index. On 12/22/2010 10:36 AM, Giri wrote: Hi, I have a solar index that has thousands of records, the title is one of the solar fields, and I would like to query for title values that are less than 50 characters long. Is there a way to construct the Solr query to provide results based on the character length? thank you very much!
Re: solr equiv of : SELECT count(distinct(field)) FROM index WHERE length(field) 0 AND other_criteria
Well, that's true -- you can get the total number of facet values if you ALSO are willing to get back every facet value in the response. If you've got a hundred thousand or so unique facet values, and what you really want is just the _count_ without ALSO getting back a very large response (and waiting for Solr to construct the very large response), then you're out of luck. But if you're willing to get back all the values in the response too, that'll work, true. On 12/22/2010 11:23 AM, Erik Hatcher wrote: On Dec 22, 2010, at 09:21 , Jonathan Rochkind wrote: This won't actually give you the number of distinct facet values, but will give you the number of documents matching your conditions. It's more equivalent to SQL without the distinct. There is no way in Solr 1.4 to get the number of distinct facet values. That's not true - the total number of facet values is the distinct number of values in that field. You need to be sure you have facet.limit=-1 (default is 100) to see all values in the response rather than just a page of them though. Erik
Re: full text search in multiple fields
Did you reindex after you changed your analyzers? On 12/22/2010 12:57 PM, PeterKerk wrote: Hi guys, There's one more thing to get this code to work as I need I just found out... Im now using:q=title_search:hort*defType=lucene as iorixxx suggested. it works good BUT, this query doesnt find results if the title in DB is Hortus supremus I tried adding some tokenizers and filters to solve this, what I think is a casing issue, but no luck... below is my code...what am I missing here? Thanks again! fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_dutch.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType field name=title type=text_ws indexed=true stored=true/ field name=title_search type=text indexed=true stored=true/ copyField source=title dest=title_search/
Re: Case Insensitive sorting while preserving case during faceted search
Hoss, I think the use case being asked about is specifically doing a facet.sort though, for cases where you actually do want to sort facet values with facet.sort, not sort records -- while still presenting the facet values with original case, but sorting them case insensitively. The solutions offered at those URLs don't address this. Because I'm pretty sure there isn't really any good solution for this, Solr just won't do that, just how it goes. On 12/21/2010 2:33 PM, Chris Hostetter wrote: : I am trying to do a facet search and sort the facet values too. ... : Then I followed the sample example schema.xml, created a copyField of type ... : fieldType name=alphaOnlySort class=solr.TextField : sortMissingLast=true omitNorms=true ... : But the sorted facet values dont have their case preserved anymore. : : How can I get around this? Did you look at how/why/when alphaOnlySort is used in the example? The FAQ entry you refered to address almost the exact same scnerio with wanting to search/sort on the same data... http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F ...the simplest thing to do is to use copyField to index a second version of your field using the StrField class. So have one version of your field using StrField that you facet on, and copyField that to another version (using TextField and KeywordTokenizer) that you sort on. -Hoss
RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?
But the entirety of the old indexes (no longer on disk) wasn't cached in memory, right? Or is it? Maybe this is me not understanding lucene enough. I thought that portions of the index were cached in disk, but that sometimes the index reader still has to go to disk to get things that aren't currently in caches. If this is true (tell me if it's not!), we have an index reader that was based on indexes that... are no longer on disk. But the index reader is still open. What happens when it has to go to disk for info? And the second replication will trigger a commit even if there are in fact no new files to be transfered over to slave, because there have been no changes since the prior sync with failed commit? From: Upayavira [...@odoko.co.uk] Sent: Tuesday, December 14, 2010 2:23 AM To: solr-user@lucene.apache.org Subject: RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected? The second commit will bring in all changes, from both syncs. Think of the sync part as a glorified rsync of files on disk. So the files will have been copied to disk, but the in memory index on the slave will not have noticed that those files have changed. The commit is intended to remedy that - it causes a new index reader to be created, based upon the new on disk files, which will include updates from both syncs. Upayavira On Mon, 13 Dec 2010 23:11 -0500, Jonathan Rochkind rochk...@jhu.edu wrote: Sorry, I guess I don't understand the details of replication enough. So slave tries to replicate. It pulls down the new index files. It tries to do a commit but fails. But the next commit that does succeed will have all the updates. Since it's a slave, it doesn't get any commits of it's own. But then some amount of time later, it does another replication pull. There are at this time maybe no _new_ changes since the last failed replication pull. Does this trigger a commit that will get those previous changes actually added to the slave? In the meantime, between commits.. are those potentially large pulled new index files sitting around somewhere but not replacing the old slave index files, doubling disk space for those files? Thanks for any clarification. Jonathan From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley [yo...@lucidimagination.com] Sent: Monday, December 13, 2010 10:41 PM To: solr-user@lucene.apache.org Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected? On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Yonik, how will maxWarmingSearchers in this scenario effect replication? If a slave is pulling down new indexes so quickly that the warming searchers would ordinarily pile up, but maxWarmingSearchers is set to 1 what happens? Like any other commits, this will limit the number of searchers warming in the background to 1. If a commit is called, and that tries to open a new searcher while another is already warming, it will fail. The next commit that does succeed will have all the updates though. Today, this maxWarmingSearchers check is done after the writer has closed and before a new searcher is opened... so calling commit too often won't affect searching, but it will currently affect indexing speed (since the IndexWriter is constantly being closed/flushed). -Yonik http://www.lucidimagination.com
Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?
Yeah, I understand basically how caches work. What I don't understand is what happens in replication if, the new segment files are succesfully copied, but the actual commit fails due to maxAutoWarmingSearches. The new files are on disk... but the commit could not succeed and there is NOT a new index reader, because the commit failed. And there is potentially a long gap before a future succesful commit. 1. Will the existing index searcher have problems because the files have been changed out from under it? 2. Will a future replication -- at which NO new files are available on master -- still trigger a future commit on slave? Maybe these are obvious to everyone but me, because I keep asking this question, and the answer I keep getting is just describing the basics of replication, as if this obviously answers my question. Or maybe the answer isn't obvious or clear to anyone including me, in which case the only way to get an answer is to try and test it myself. A bit complicated to test, at least for my level of knowledge, as I'm not sure exactly what I'd be looking for to answer either of those questions. Jonathan On 12/14/2010 9:53 AM, Upayavira wrote: A Lucene index is made up of segments. Each commit writes a segment. Sometimes, upon commit, some segments are merged together into one, to reduce the overall segment count, as too many segments hinders performance. Upon optimisation, all segments are (typically) merged into a single segment. Replication copies any new segments from the master to the slave, whether they be new segments arriving from a commit, or new segments that are a result of a segment merge. The result is a set of index files on disk that are a clean mirror of the master. Then, when your replication process has finished syncing changed segments, it fires a commit on the slave. This causes Solr to create a new index reader. When the first query comes in, this triggers Solr to populate caches. Whoever was unfortunate to cause that cache population will see poorer results (we've seen 40s responses rather than 1s). The solution to this is to set up an autowarming query in solrconfig.xml. This query is executed against the new index reader, causing caches to populate from the updated files on disk. Only once that autowarming query has completed will the index reader be made available to Solr for answering search queries. There's some cleverness that I can't remember the details of specifying how much to keep from the existing caches, and how much to build up from the files on disk. If I recall, it is all configured in solrconfig.xml. You ask a good question whether a commit will be triggered if the sync brought over no new files (i.e. if the previous one did, but this one didn't). I'd imagine that Solr would compare the maximum segment ID on disk with the one in memory to make such a decision, in which case Solr would spot the changes from the previous sync and still work. The best way to be sure is to try it! The simplest way to try it (as I would do it) would be to: 1) switch off post-commit replication 2) post some content to solr 3) commit on the master 4) use rsync to copy the indexes from the master to the slave 5) do another (empty) commit on the master 6) trigger replication via an HTTP request to the slave 7) See if your posted content is available on your slave. Maybe someone else here can tell you what is actually going on and save you the effort! Does that help you get some understand what is going on? Upayavira On Tue, 14 Dec 2010 09:15 -0500, Jonathan Rochkindrochk...@jhu.edu wrote: But the entirety of the old indexes (no longer on disk) wasn't cached in memory, right? Or is it? Maybe this is me not understanding lucene enough. I thought that portions of the index were cached in disk, but that sometimes the index reader still has to go to disk to get things that aren't currently in caches. If this is true (tell me if it's not!), we have an index reader that was based on indexes that... are no longer on disk. But the index reader is still open. What happens when it has to go to disk for info? And the second replication will trigger a commit even if there are in fact no new files to be transfered over to slave, because there have been no changes since the prior sync with failed commit? From: Upayavira [...@odoko.co.uk] Sent: Tuesday, December 14, 2010 2:23 AM To: solr-user@lucene.apache.org Subject: RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected? The second commit will bring in all changes, from both syncs. Think of the sync part as a glorified rsync of files on disk. So the files will have been copied to disk, but the in memory index on the slave will not have noticed that those files have changed. The commit is intended to remedy that - it causes a new index reader to be created, based upon the new on disk files, which will include updates from both