Re: EmbeddedSolrServer removed quietly in 7.0?
On Tue, Jan 9, 2018 at 11:37 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 1/9/2018 2:42 AM, Robert Krüger wrote: > >> I am looking to upgrade an application that still uses version 4.6.1 to >> latest solr (7.2) but realized that support for EmbeddedSolrServer seems >> to >> have vanished with 7.0 but I could find no mention of it in the release >> notes, which strikes me as odd for such a disruptive change. Is it >> somewhere else or has it just gone silently? The class wasn't deprecated >> in >> 6.x as far as I can see. What am I missing? >> >> I have no problem updating to 6.6.2 for now to delay having to worry about >> a bigger migration but just don't want to do it, because I am missing >> something obvious. >> >> Btw. I do not want to start yet another thread about the merits of using >> an >> embedded solr server ;-). >> > > It's still there. Here's the 7.2 javadoc: > > https://lucene.apache.org/solr/7_2_0/solr-core/org/apache/ > solr/client/solrj/embedded/EmbeddedSolrServer.html Duh, thanks so much. I was looking in the solrj javadoc, not solr-core! My bad. > > > Since 5.0, it is a descendant of SolrClient, and the SolrServer abstract > class was removed in 6.0. > > Regarding whether you should use the embedded server or not: For > production usage, I would not recommend it, because redundancy is not > possible. But there are certain situations (mostly dev/testing) where it's > absolutely the right solution. > We're using it in a classic embedded situation in a desktop app where redundancy is not an issue and are super-happy with it. > > I am not expecting EmbeddedSolrServer to be removed. It is used > extensively in Solr tests and by many users in the wild. > > That is very good to know. Thanks again!
Re: Efficiency of integer storage/use
Thanks everyone, for your answers. I will probably make a simple parametric test pumping a solr index full of those integers with very limited range and then sorting by vector distances to see how the performance characteristics are. On Sun, Oct 18, 2015 at 9:08 PM, Mikhail Khludnev < mkhlud...@griddynamics.com> wrote: > Robert, > From what I know as inverted index as docvalues compress content much, even > stored fields compressed too. So, I think you have much chance to > experiment successfully. You might need tweak schema disabling storing > unnecessary info in the index. > > On Sat, Oct 17, 2015 at 1:15 AM, Robert Krüger <krue...@lesspain.de> > wrote: > > > Thanks for the feedback. > > > > What I am trying to do is to "abuse" integers to store 8bit (or even > lower) > > values of metrics I use for content-based image/video search (such as > > statistical values regarding color distribution) and then implement > > similarity calculations based on formulas using vector distances. The > Index > > can become large (tens of millions of documents each with say 50-100 > > integers describing the image metrics). I am looking at using a part of > > those metrics for selecting a subset of images using range queries and > then > > more for sorting the result set by relevance. > > > > I was first looking at implementing those metrics as binary fields (see > > other posting) and then use a custom function for the distance > calculation > > but so far I got the impression that way is not supported really well by > > Solr. Base64-En/Decoding would kill performance and implementing a custom > > field type with all that is probably required for that to work properly > is > > currently beyond my Solr knowledge. Besides, using built-in Solr features > > makes it easier to finetune/experiment with different approaches, > because I > > can just play around with different queries and see what works best, > > without each time adjusting a custom function. > > > > I hope that provides a better picture of what I am trying to achieve. > > > > Best, > > > > Robert > > > > On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > > > Under the covers, Lucene stores ints in a packed format, so I'd just > > count > > > on that for a first pass. > > > > > > What is "a lot of integer values"? Hundreds of millions? Billions? > > > Trillions? > > > > > > Unless you give us some indication of scale, it's hard to say anything > > > helpful. But unless you have some evidence that your going to blow out > > > memory I'd just ignore the "wasted" bits. Especially if you can use > > > docValues, > > > that option holds much of the underlying data in MMapDirectory > > > that uses swappable OS memory > > > > > > Best, > > > Erick > > > > > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de> > > > wrote: > > > > Hi, > > > > > > > > I have a data model where I would store and index a lot of integer > > values > > > > with a very restricted range (e.g. 0-255), so theoretically the 32 > bits > > > of > > > > Solr's integer fields are complete overkill. I want to be able to to > > > things > > > > like vector distance calculations on those fields. Should I worry > about > > > the > > > > "wasted" bits or will Solr compress/organize the index in a way that > > > > compensates for this if there are only 256 (or even fewer) distinct > > > values? > > > > > > > > Any recommendations on how my fields should be defined to make things > > > like > > > > numeric functions work as fast as technically possible? > > > > > > > > Thanks in advance, > > > > > > > > Robert > > > > > > > > > > > -- > > Robert Krüger > > Managing Partner > > Lesspain GmbH & Co. KG > > > > www.lesspain-software.com > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <mkhlud...@griddynamics.com> > -- Robert Krüger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com
Re: Efficiency of integer storage/use
Thanks for the feedback. What I am trying to do is to "abuse" integers to store 8bit (or even lower) values of metrics I use for content-based image/video search (such as statistical values regarding color distribution) and then implement similarity calculations based on formulas using vector distances. The Index can become large (tens of millions of documents each with say 50-100 integers describing the image metrics). I am looking at using a part of those metrics for selecting a subset of images using range queries and then more for sorting the result set by relevance. I was first looking at implementing those metrics as binary fields (see other posting) and then use a custom function for the distance calculation but so far I got the impression that way is not supported really well by Solr. Base64-En/Decoding would kill performance and implementing a custom field type with all that is probably required for that to work properly is currently beyond my Solr knowledge. Besides, using built-in Solr features makes it easier to finetune/experiment with different approaches, because I can just play around with different queries and see what works best, without each time adjusting a custom function. I hope that provides a better picture of what I am trying to achieve. Best, Robert On Fri, Oct 16, 2015 at 4:50 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Under the covers, Lucene stores ints in a packed format, so I'd just count > on that for a first pass. > > What is "a lot of integer values"? Hundreds of millions? Billions? > Trillions? > > Unless you give us some indication of scale, it's hard to say anything > helpful. But unless you have some evidence that your going to blow out > memory I'd just ignore the "wasted" bits. Especially if you can use > docValues, > that option holds much of the underlying data in MMapDirectory > that uses swappable OS memory.... > > Best, > Erick > > On Fri, Oct 16, 2015 at 1:53 AM, Robert Krüger <krue...@lesspain.de> > wrote: > > Hi, > > > > I have a data model where I would store and index a lot of integer values > > with a very restricted range (e.g. 0-255), so theoretically the 32 bits > of > > Solr's integer fields are complete overkill. I want to be able to to > things > > like vector distance calculations on those fields. Should I worry about > the > > "wasted" bits or will Solr compress/organize the index in a way that > > compensates for this if there are only 256 (or even fewer) distinct > values? > > > > Any recommendations on how my fields should be defined to make things > like > > numeric functions work as fast as technically possible? > > > > Thanks in advance, > > > > Robert > -- Robert Krüger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com
Efficiency of integer storage/use
Hi, I have a data model where I would store and index a lot of integer values with a very restricted range (e.g. 0-255), so theoretically the 32 bits of Solr's integer fields are complete overkill. I want to be able to to things like vector distance calculations on those fields. Should I worry about the "wasted" bits or will Solr compress/organize the index in a way that compensates for this if there are only 256 (or even fewer) distinct values? Any recommendations on how my fields should be defined to make things like numeric functions work as fast as technically possible? Thanks in advance, Robert
Problem with custom function on binary content
Hi, I am trying to implement a custom function that evaluates fields stored as type "binary" (BinaryField). I have my ValueSourceParser set up and I can retrieve the arguments of my function correctly and have a reference to the SchemaField. Now I am a bit stuck, how I can retrieve the binary content from a ValueSource? Let's make this easier with an example. Let's say I have a field named "fingerprint" defined as binary and I want to implement a custom function "norm" that computes an integer from that binary content that can be used in sorting or whatever. The field is declared like this: I use "norm(fingerprint)" as my sorting expression and it arrives fine in my ValueSourceParser. To obtain a value source to use for the field value I tried two things: 1) parse the value source using parseValueSource() 2) parse the field name using parseId() and then something like: SchemaField schemaField = fp.getReq().getSchema().getField(arg1); ValueSource fieldValueSource = schemaField.getType().getValueSource(schemaField, fp); I thought I would just use that value source in my custom FingerprintNormValueSource to retrieve the byte array via the method org.apache.lucene.queries.function.FunctionValues#objectVal, however, that always returns null for my documents. Now I noticed that the ValueSource I get for the field is an instance of org.apache.solr.schema.StrFieldSource and there objectVal is implemented as strVal, which obviously will never return my fingerprint byte array. Do I have to encode my byte arrays as strings or is there a direct way to retrieve the binary data for use in my custom ValueSource? Thanks a lot, Robert
Initializing core takes very long at times
Hi, for months/years, I have been experiencing occasional very long (30s+) hangs when programmatically initializing a solr container in Java. The application has worked for years in production with this setup without any problems apart from this. The code I have is this here: public void initContainer(File solrConfig) throws Exception { logger.debug(initializing solr container with config {}, solrConfig); Preconditions.checkNotNull(solrConfig); Stopwatch stopwatch = Stopwatch.createStarted(); container = CoreContainer.createAndLoad(solrConfig.getParentFile().getAbsolutePath(), solrConfig); containerInitialized = true; logger.debug(initializing solr container took {}, stopwatch); if (stopwatch.elapsed(TimeUnit.MILLISECONDS) 1000) { logger.warn(initializing solr container took very long ({}), stopwatch); } } So it is obviously the createAndLoad-Call. I posted about this a long time ago and people suggested checking for uncommitted soft commits but now I realized that I had these hangs in a test setup, where the index is created from scratch, so that cannot be the problem. Any ideas anyone? My config is rather simple. Is there something wring with my locking options that might cause this? config indexConfig lockTypenative/lockType unlockOnStartuptrue/unlockOnStartup ramBufferSizeMB64/ramBufferSizeMB maxBufferedDocs1000/maxBufferedDocs /indexConfig luceneMatchVersionLUCENE_43/luceneMatchVersion updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs1000/maxDocs maxTime1/maxTime openSearcherfalse/openSearcher /autoCommit /updateHandler requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=2048 / /requestDispatcher requestHandler name=standard class=solr.StandardRequestHandler default=true / requestHandler name=/update class=solr.UpdateRequestHandler / requestHandler name=/admin/ class=org.apache.solr.handler.admin.AdminHandlers / filterCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=256/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=256/ documentCache class=solr.LRUCache size=1 initialSize=512 autowarmCount=0/ admin defaultQuerysolr/defaultQuery /admin /config Thank you in advance, Robert
Embedded Solr now deprecated?
Hi, I tried to upgrade my application from solr 4 to 5 and just now realized that embedded use of solr seems to be on the way out. Is that correct or is there a just new API to use for that? Thanks in advance, Robert
Re: Initializing core takes very long at times
OK, now that I had a reproducible setup I could debug where it hangs: public SystemInfoHandler(CoreContainer cc) { super(); this.cc = cc; init(); } private void init() { try { InetAddress addr = InetAddress.getLocalHost(); hostname = addr.getCanonicalHostName(); this is where it hangs } catch (UnknownHostException e) { //default to null } } so it depends on my current network setup even for the embedded case. any idea how I can stop solr from making that call? InetAddress.getLocalHost() in this case returns some local vpn address and thus the reverse lookup times out after 30 seconds. This actually happens twice, once when initializing the container, again when initializing the core, so in my case a minute per restart and looking at the code, I don't see how I can work around this other than patching solr, which I am trying to avoid like hell. On Wed, Aug 5, 2015 at 1:54 PM, Robert Krüger krue...@lesspain.de wrote: Hi, for months/years, I have been experiencing occasional very long (30s+) hangs when programmatically initializing a solr container in Java. The application has worked for years in production with this setup without any problems apart from this. The code I have is this here: public void initContainer(File solrConfig) throws Exception { logger.debug(initializing solr container with config {}, solrConfig); Preconditions.checkNotNull(solrConfig); Stopwatch stopwatch = Stopwatch.createStarted(); container = CoreContainer.createAndLoad(solrConfig.getParentFile().getAbsolutePath(), solrConfig); containerInitialized = true; logger.debug(initializing solr container took {}, stopwatch); if (stopwatch.elapsed(TimeUnit.MILLISECONDS) 1000) { logger.warn(initializing solr container took very long ({}), stopwatch); } } So it is obviously the createAndLoad-Call. I posted about this a long time ago and people suggested checking for uncommitted soft commits but now I realized that I had these hangs in a test setup, where the index is created from scratch, so that cannot be the problem. Any ideas anyone? My config is rather simple. Is there something wring with my locking options that might cause this? config indexConfig lockTypenative/lockType unlockOnStartuptrue/unlockOnStartup ramBufferSizeMB64/ramBufferSizeMB maxBufferedDocs1000/maxBufferedDocs /indexConfig luceneMatchVersionLUCENE_43/luceneMatchVersion updateHandler class=solr.DirectUpdateHandler2 autoCommit maxDocs1000/maxDocs maxTime1/maxTime openSearcherfalse/openSearcher /autoCommit /updateHandler requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=2048 / /requestDispatcher requestHandler name=standard class=solr.StandardRequestHandler default=true / requestHandler name=/update class=solr.UpdateRequestHandler / requestHandler name=/admin/ class=org.apache.solr.handler.admin.AdminHandlers / filterCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=256/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=256/ documentCache class=solr.LRUCache size=1 initialSize=512 autowarmCount=0/ admin defaultQuerysolr/defaultQuery /admin /config Thank you in advance, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Initializing core takes very long at times
I am shipping solr as a local search engine with our software, so I have no way of controlling that environment. Many other software packages (rdbmss, nosql engines etc.) work well in such a setup (as does solr except this problem). The problem is that in this case (AFAICS) the host cannot be overridden in any way (by config or system property or whatever), because that handler is coded as it is. It is in no way a natural limitation of the type of software or my use case. But I understand that this is probably not frequently a problem for people, because by far most solr use is classic server-based use. I may suggest a patch on the devel mailing list. On Wed, Aug 5, 2015 at 5:42 PM, Shawn Heisey apa...@elyograg.org wrote: On 8/5/2015 7:56 AM, Robert Krüger wrote: OK, now that I had a reproducible setup I could debug where it hangs: public SystemInfoHandler(CoreContainer cc) { super(); this.cc = cc; init(); } private void init() { try { InetAddress addr = InetAddress.getLocalHost(); hostname = addr.getCanonicalHostName(); this is where it hangs } catch (UnknownHostException e) { //default to null } } so it depends on my current network setup even for the embedded case. any idea how I can stop solr from making that call? InetAddress.getLocalHost() in this case returns some local vpn address and thus the reverse lookup times out after 30 seconds. This actually happens twice, once when initializing the container, again when initializing the core, so in my case a minute per restart and looking at the code, I don't see how I can work around this other than patching solr, which I am trying to avoid like hell. Because almost all users are using Solr in a mode where it requires the network, that code cannot be eliminated from Solr. It is critical that your machine's local network is set up completely right when you are running applications that are (normally) network-aware. Generally that means having relevant entries for all interfaces in your hosts file and making sure that the DNS resolver code included with the operating system is not buggy. If you're dealing with a VPN or something else where the address is acquired from elsewhere, then you need to make sure that the machine has at least two DNS servers configured, that one of them is working, and that the forward and reverse DNS on those servers are completely set up for that interface's IP address. Bugs in the DNS resolver code can complicate this. Thanks, Shawn -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Embedded Solr now deprecated?
I just saw lots of deprecation warnings in my current code and a method that was removed, which is why I asked. Regarding the use case, I am embedding it with a desktop application just as others use java-based no-sql or rdbms engines and that makes sense architecturally in my case and is just simpler than deploying a separate little tomcat instance. API-wise, I know it is the same and it would be doable to do it that way. The embedded option is just the logical and simpler choice in terms of delivery, packaging, installation, automated testing in this case. The network option just doesn't add anything here apart from overhead (probably negligible in our case) and complexity. So our use is in production but in a desktop way, not what people normally think about when they hear production use. Thanks everyone for the quick feedback! I am very relieved to hear it is not on its way out and I will look at the api changes more closely and try to get our application running on 5.2.1. Best regards, Robert On Wed, Aug 5, 2015 at 4:34 PM, Erick Erickson erickerick...@gmail.com wrote: Where did you see that? Maybe I missed something yet again. This is unrelated to whether we ship a WAR if that's being conflated here. I rather doubt that embedded is on it's way out, although my memory isn't what it used to be. For starters, MapReduceIndexerTool uses it, so it gets regular exercise from that, and anything removing it would require some kind of replacement. How are you using it that you care? Wondering what alternatives exist... Best, Erick On Wed, Aug 5, 2015 at 9:09 AM, Robert Krüger krue...@lesspain.de wrote: Hi, I tried to upgrade my application from solr 4 to 5 and just now realized that embedded use of solr seems to be on the way out. Is that correct or is there a just new API to use for that? Thanks in advance, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Initializing core takes very long at times
I just posted on lucene-dev. I think just replacing getCanonicalHostName by getHostName might do the Job. At least that's exactly what Logback does for this purpose: http://logback.qos.ch/xref/ch/qos/logback/core/util/ContextUtil.html On Wed, Aug 5, 2015 at 6:57 PM, Erick Erickson erickerick...@gmail.com wrote: All patches welcome! On Wed, Aug 5, 2015 at 12:40 PM, Robert Krüger krue...@lesspain.de wrote: I am shipping solr as a local search engine with our software, so I have no way of controlling that environment. Many other software packages (rdbmss, nosql engines etc.) work well in such a setup (as does solr except this problem). The problem is that in this case (AFAICS) the host cannot be overridden in any way (by config or system property or whatever), because that handler is coded as it is. It is in no way a natural limitation of the type of software or my use case. But I understand that this is probably not frequently a problem for people, because by far most solr use is classic server-based use. I may suggest a patch on the devel mailing list. On Wed, Aug 5, 2015 at 5:42 PM, Shawn Heisey apa...@elyograg.org wrote: On 8/5/2015 7:56 AM, Robert Krüger wrote: OK, now that I had a reproducible setup I could debug where it hangs: public SystemInfoHandler(CoreContainer cc) { super(); this.cc = cc; init(); } private void init() { try { InetAddress addr = InetAddress.getLocalHost(); hostname = addr.getCanonicalHostName(); this is where it hangs } catch (UnknownHostException e) { //default to null } } so it depends on my current network setup even for the embedded case. any idea how I can stop solr from making that call? InetAddress.getLocalHost() in this case returns some local vpn address and thus the reverse lookup times out after 30 seconds. This actually happens twice, once when initializing the container, again when initializing the core, so in my case a minute per restart and looking at the code, I don't see how I can work around this other than patching solr, which I am trying to avoid like hell. Because almost all users are using Solr in a mode where it requires the network, that code cannot be eliminated from Solr. It is critical that your machine's local network is set up completely right when you are running applications that are (normally) network-aware. Generally that means having relevant entries for all interfaces in your hosts file and making sure that the DNS resolver code included with the operating system is not buggy. If you're dealing with a VPN or something else where the address is acquired from elsewhere, then you need to make sure that the machine has at least two DNS servers configured, that one of them is working, and that the forward and reverse DNS on those servers are completely set up for that interface's IP address. Bugs in the DNS resolver code can complicate this. Thanks, Shawn -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Example of sorting by custom function
Thanks a lot! Robert On Fri, Apr 3, 2015 at 5:34 PM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: ValueSourceParser — yes. You’ll find a ton of them in Solr to get ideas from. In your example you forgot the “asc” or “desc”. ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Fri, Apr 3, 2015 at 9:44 AM, Robert Krüger krue...@lesspain.de wrote: Hi, I have been looking around on the web for information on sorting by a custom function but the results are inconclusive to me and some of it seems so old that I suspect it's outdated. What I want to do is the following: I have a field fingerprint in my schema that contains a binary data (e.g. 64 bytes) that can be used to compute a distance/similarity between two records. Now I want to be able to execute a query and sort its result by the distance of the matching records from a given reference value, so the query would look something like this q=*:*sort=my_distance_func(fingerprint, 0xadet54786eguizgig) where my_distance_func is my custom function fingerprint is the field in my schema (type binary) 0xadet54786eguizgig is a reference value (byte array encoded in whatever way) to which the distance shall be computed, which differs with each query. Is ValueSourceParser the right way to look here? Is there a source code example someone can point me to? Thanks in advance, Robert
Example of sorting by custom function
Hi, I have been looking around on the web for information on sorting by a custom function but the results are inconclusive to me and some of it seems so old that I suspect it's outdated. What I want to do is the following: I have a field fingerprint in my schema that contains a binary data (e.g. 64 bytes) that can be used to compute a distance/similarity between two records. Now I want to be able to execute a query and sort its result by the distance of the matching records from a given reference value, so the query would look something like this q=*:*sort=my_distance_func(fingerprint, 0xadet54786eguizgig) where my_distance_func is my custom function fingerprint is the field in my schema (type binary) 0xadet54786eguizgig is a reference value (byte array encoded in whatever way) to which the distance shall be computed, which differs with each query. Is ValueSourceParser the right way to look here? Is there a source code example someone can point me to? Thanks in advance, Robert
Re: Easiest way to embed solr in a desktop application
Hi Ahmet, at first glance, I'm not sure. Need to look at it more carefully. Thanks, Robert On Thu, Jan 15, 2015 at 3:53 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Robert, Never used by myself but is solr-packager useful in your case? http://sourcesense.github.io/solr-packager/ Ahmet On Thursday, January 15, 2015 4:45 PM, Robert Krüger krue...@lesspain.de wrote: Hi Andrea, you are assuming correctly. It is a local, non-distributed index that is only accessed by the containing desktop application. Do you know if there is a possibility to run the Solr admin UI on top of an embedded instance somehow? Thanks a lot, Robert On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com wrote: Hi Robert, I've used the EmbeddedSolrServer in a scenario like that and I never had problems. I assume you're talking about a standalone application, where the whole index resides locally and you don't need any cluster / cloud / distributed feature. I think the usage of EmbeddedSolrServer is discouraged in a (distributed) service scenario, because it is a direct connection to a SolrCore instance...but this is not a problem in the situation you described (as far as I know) Best, Andrea On 01/15/2015 03:10 PM, Robert Krüger wrote: Hi, I have been using an embedded instance of solr in my desktop application for a long time and it works fine. At the time when I made that decision (vs. firing up a solr web application within my swing application) I got the impression embedded use is somewhat unsupported and I should expect problems. My first question is, is this still the case now (4 years later), that embedded solr is discouraged? The one limitation I am running into is that I cannot use the solr admin UI for debugging purposes (mainly for running queries). Is there any other way to do this other than no longer using embedded solr and programmatically firing up a web application (e.g. using jetty)? Should I do the latter anyway? Any insights/advice greatly appreciated. Best regards, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Easiest way to embed solr in a desktop application
I was considering the programmatic Jetty option but then I read that Solr 5 no longer supports being run with an external servlet container but maybe they still support programmatic jetty use in some way. atm I am using solr 4.x, so this would work. No idea if this gets messy classloader-wise in any way. I have been using exactly the approach you described in the past, i.e. I built a really, really simple swing dialogue to input queries and display results in a table but was just guessing that the built-in ui was far superior but maybe I should just live with it for the time being. On Thu, Jan 15, 2015 at 3:56 PM, Erik Hatcher erik.hatc...@gmail.com wrote: It’d certainly be easiest to just embed Jetty into your application. You don’t need to have Jetty as a separate process, you could launch it through it’s friendly Java API, configured to use solr.war. If all you needed was to make HTTP(-like) queries to Solr instead of the full admin UI, your application could stick to using EmbeddedSolrServer and also provide a UI that takes in a Solr query string (or builds one up) and then sends it to the embedded Solr and displays the result. Erik On Jan 15, 2015, at 9:44 AM, Robert Krüger krue...@lesspain.de wrote: Hi Andrea, you are assuming correctly. It is a local, non-distributed index that is only accessed by the containing desktop application. Do you know if there is a possibility to run the Solr admin UI on top of an embedded instance somehow? Thanks a lot, Robert On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com wrote: Hi Robert, I've used the EmbeddedSolrServer in a scenario like that and I never had problems. I assume you're talking about a standalone application, where the whole index resides locally and you don't need any cluster / cloud / distributed feature. I think the usage of EmbeddedSolrServer is discouraged in a (distributed) service scenario, because it is a direct connection to a SolrCore instance...but this is not a problem in the situation you described (as far as I know) Best, Andrea On 01/15/2015 03:10 PM, Robert Krüger wrote: Hi, I have been using an embedded instance of solr in my desktop application for a long time and it works fine. At the time when I made that decision (vs. firing up a solr web application within my swing application) I got the impression embedded use is somewhat unsupported and I should expect problems. My first question is, is this still the case now (4 years later), that embedded solr is discouraged? The one limitation I am running into is that I cannot use the solr admin UI for debugging purposes (mainly for running queries). Is there any other way to do this other than no longer using embedded solr and programmatically firing up a web application (e.g. using jetty)? Should I do the latter anyway? Any insights/advice greatly appreciated. Best regards, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Easiest way to embed solr in a desktop application
Hi Andrea, you are assuming correctly. It is a local, non-distributed index that is only accessed by the containing desktop application. Do you know if there is a possibility to run the Solr admin UI on top of an embedded instance somehow? Thanks a lot, Robert On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini a.gazzar...@gmail.com wrote: Hi Robert, I've used the EmbeddedSolrServer in a scenario like that and I never had problems. I assume you're talking about a standalone application, where the whole index resides locally and you don't need any cluster / cloud / distributed feature. I think the usage of EmbeddedSolrServer is discouraged in a (distributed) service scenario, because it is a direct connection to a SolrCore instance...but this is not a problem in the situation you described (as far as I know) Best, Andrea On 01/15/2015 03:10 PM, Robert Krüger wrote: Hi, I have been using an embedded instance of solr in my desktop application for a long time and it works fine. At the time when I made that decision (vs. firing up a solr web application within my swing application) I got the impression embedded use is somewhat unsupported and I should expect problems. My first question is, is this still the case now (4 years later), that embedded solr is discouraged? The one limitation I am running into is that I cannot use the solr admin UI for debugging purposes (mainly for running queries). Is there any other way to do this other than no longer using embedded solr and programmatically firing up a web application (e.g. using jetty)? Should I do the latter anyway? Any insights/advice greatly appreciated. Best regards, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Easiest way to embed solr in a desktop application
Hi, I have been using an embedded instance of solr in my desktop application for a long time and it works fine. At the time when I made that decision (vs. firing up a solr web application within my swing application) I got the impression embedded use is somewhat unsupported and I should expect problems. My first question is, is this still the case now (4 years later), that embedded solr is discouraged? The one limitation I am running into is that I cannot use the solr admin UI for debugging purposes (mainly for running queries). Is there any other way to do this other than no longer using embedded solr and programmatically firing up a web application (e.g. using jetty)? Should I do the latter anyway? Any insights/advice greatly appreciated. Best regards, Robert
Re: Performance/scaling with custom function queries
Thanks for the info. I will look at that. On Wed, Jun 11, 2014 at 3:47 PM, Joel Bernstein joels...@gmail.com wrote: In Solr 4.9 there is a feature called RankQueries, that allows you to plugin your own ranking collector. So, if you wanted to write a ranking/sorting collector that used a thread per segment, you could cleanly plug it in. Joel Bernstein Search Engineer at Heliosearch On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote: Or will I have to set up distributed search to achieve that? Yes — you have to shard it to achieve that. The shards could be on the same node. There were some discussions this year in JIRA about being able to do thread-per-segment but it’s not quite there yet. FWIW I think it would be a nice option for some use-cases (like yours). ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Performance/scaling with custom function queries
Would Solr use multithreading to process the records of a function query as described above? In my scenario concurrent searches are not the issue, rather the speed of one query will be the optimization target. Or will I have to set up distributed search to achieve that? Thanks, Robert On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger krue...@lesspain.de wrote: Great, I was hoping for that. In my case I will have to deal with the worst case scenario, i.e. all documents matching the query, because the only criterion is the fingerprint and the result of the distance/similarity function which will have to be executed for every document. However, I am dealing with a scenario where there will not be many concurrent users. Thank you. On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote: You only need to have fast access to the fingerprint field so only that field needs to be in memory. You'll want to review how Lucene DocValues and FieldCache work. Sorting is done with a PriorityQueue so only the top N docs are kept in memory. You'll only need to access the fingerprint field values for documents that match the query, so it won't be a full table scan unless all the docs match the query. Sounds like an interesting project. Please keep us posted. Joel Bernstein Search Engineer at Heliosearch On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote: Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Performance/scaling with custom function queries
Great, I was hoping for that. In my case I will have to deal with the worst case scenario, i.e. all documents matching the query, because the only criterion is the fingerprint and the result of the distance/similarity function which will have to be executed for every document. However, I am dealing with a scenario where there will not be many concurrent users. Thank you. On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote: You only need to have fast access to the fingerprint field so only that field needs to be in memory. You'll want to review how Lucene DocValues and FieldCache work. Sorting is done with a PriorityQueue so only the top N docs are kept in memory. You'll only need to access the fingerprint field values for documents that match the query, so it won't be a full table scan unless all the docs match the query. Sounds like an interesting project. Please keep us posted. Joel Bernstein Search Engineer at Heliosearch On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote: Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Performance/scaling with custom function queries
Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert
Set up embedded Solr container and cores programmatically to read their configs from the classpath
Hi, I have an application with an embedded Solr instance (and I want to keep it embedded) and so far I have been setting up my Solr installation programmatically using folder paths to specify where the specific container or core configs are. I have used the CoreContainer methods createAndLoad and create using File arguments and this works fine. However, now I want to change this so that all configuration files are loaded from certain locations using the classloader but I have not been able to get this to work. E.g. I want to have my solr config located in the classpath at my/base/package/solr/conf and the core configs at my/base/package/solr/cores/core1/conf, my/base/package/solr/cores/core2/conf etc.. Is this possible at all? Looking through the source code it seems that specifying classpath resources in such a qualified way is not supported but I may be wrong. I could get this to work for the container by supplying my own implementation of SolrResourceLoader that allows a base path to be specified for the resources to be loaded (I first thought that would happen already when specifying instanceDir accordingly but looking at the code it does not. for resources loaded through the classloader, instanceDir is not prepended). However then I am stuck with the loading of the cores' resources as the respective code (see org.apache.solr.core.CoreContainer#createFromLocal) instantiates a SolResourceLoader internally. Thanks for any help with this (be it a clarification that it is not possible). Robert
Programatic instantiation of solr container and cores with config loaded from a jar
Hi, I use solr embedded in a desktop app and I want to change it to no longer require the configuration for the container and core to be in the filesystem but rather be distributed as part of a jar file. Could someone kindly point me to the right docs? So far my impression is, I need to instantiate CoreContainer with a custom SolrResourceLoader with properties parsed via some other API but from the javadocs alone I feel a bit lost (why does it have to have an instance directory at all?) and googling did not give me many results. What would be ideal would be to have something like this (pseudocode with partly imagined names, which hopefully illustrates what I am trying to achieve): ContainerConfig containerConfig = ContainerConfigParser.parse(InputStream from Classloader); CoreContainer container = new CoreContainer(containerConfig); CoreConfig coreConfig = CoreConfigParser.parse(container, InputStream from Classloader); container.register(name, coreConfig); Ideally I would like to keep XML format to reuse my current solr.xml and solrconfig.xml but that is just a nice-to-have. Does such a way exist and if so, what are the real API classes and calls to use? Thank you in advance, Robert
Re: Programatic instantiation of solr container and cores with config loaded from a jar
Great, thank you! On Jul 22, 2013 1:35 PM, Alan Woodward a...@flax.co.uk wrote: Hi Robert, The upcoming 4.4 release should make this a bit easier (you can check out the release branch now if you like, or wait a few days for the official version). CoreContainer now takes a SolrResourceLoader and a ConfigSolr object as constructor parameters, and you can create a ConfigSolr object from a string representation of solr.xml using the ConfigSolr.fromString() static method. Alan Woodward www.flax.co.uk On 22 Jul 2013, at 11:41, Robert Krüger wrote: Hi, I use solr embedded in a desktop app and I want to change it to no longer require the configuration for the container and core to be in the filesystem but rather be distributed as part of a jar file. Could someone kindly point me to the right docs? So far my impression is, I need to instantiate CoreContainer with a custom SolrResourceLoader with properties parsed via some other API but from the javadocs alone I feel a bit lost (why does it have to have an instance directory at all?) and googling did not give me many results. What would be ideal would be to have something like this (pseudocode with partly imagined names, which hopefully illustrates what I am trying to achieve): ContainerConfig containerConfig = ContainerConfigParser.parse(InputStream from Classloader); CoreContainer container = new CoreContainer(containerConfig); CoreConfig coreConfig = CoreConfigParser.parse(container, InputStream from Classloader); container.register(name, coreConfig); Ideally I would like to keep XML format to reuse my current solr.xml and solrconfig.xml but that is just a nice-to-have. Does such a way exist and if so, what are the real API classes and calls to use? Thank you in advance, Robert
Is Overlapping onDeckSearchers=2 really a problem?
Hi, I have a desktop application where I am abusing solr as an embedded database accessing it and I am quite happy with everything. Performance is more than goog enough for my use case and Solr's query capabilities match the requirements of my app quite well. However, I have the well-known performance warnings (see subject) in the log whenever I index a lot of documents, although I never experience any performance problems (might be hidden, though). The properties of my app are: - I (soft-)commit after every indexed item because I need the changes to be visible immediately - The commits are serialized - I do not have any warming queries configured I have read the FAQ but don't see anthing that helps in my case. As I said, I am happy with everything as it is but the warning makes me a bit nervous (and maybe at some point my customers when their logs are full of those warnings). What could I do to eliminate it? Can I configure only one searcher to be used or anything like that? Thanks for any hints, Robert
Re: Is Overlapping onDeckSearchers=2 really a problem?
Hi, On Thu, Jun 27, 2013 at 12:23 PM, Robert Krüger krue...@lesspain.de wrote: Hi, I have a desktop application where I am abusing solr as an embedded database accessing it and I am quite happy with everything. Performance is more than goog enough for my use case and Solr's query capabilities match the requirements of my app quite well. However, I have the well-known performance warnings (see subject) in the log whenever I index a lot of documents, although I never experience any performance problems (might be hidden, though). The properties of my app are: - I (soft-)commit after every indexed item because I need the changes to be visible immediately - The commits are serialized - I do not have any warming queries configured I have read the FAQ but don't see anthing that helps in my case. As I said, I am happy with everything as it is but the warning makes me a bit nervous (and maybe at some point my customers when their logs are full of those warnings). What could I do to eliminate it? Can I configure only one searcher to be used or anything like that? Thanks for any hints, Robert sometime forcing oneself to describe a problem is the first step to a solution. I just realized that I also had an autocommit statement in my config with the exact same amount of time the seemed to be between the warnings. I removed that, because I don't think I really need it, and now the warnings are gone. So it seems it happened whenever my manual commits overlapped with an autocommit, which, of course, was more likely when many commits were issued in sequence.
Re: Is Overlapping onDeckSearchers=2 really a problem?
Shawn, On Thu, Jun 27, 2013 at 5:03 PM, Shawn Heisey s...@elyograg.org wrote: On 6/27/2013 5:59 AM, Robert Krüger wrote: sometime forcing oneself to describe a problem is the first step to a solution. I just realized that I also had an autocommit statement in my config with the exact same amount of time the seemed to be between the warnings. I removed that, because I don't think I really need it, and now the warnings are gone. So it seems it happened whenever my manual commits overlapped with an autocommit, which, of course, was more likely when many commits were issued in sequence. If all you are doing is soft commits, your transaction logs are going to grow out of control. you are absolutely right. I was shooting myself in the foot with that change. http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_startup My recommendation: 1) Remove all commits from your indexing application. 2) Configure autoCommit with values similar to that wiki page. 3) Configure autoSoftCommit to happen often. The autoCommit must have openSearcher set to false. For autoSoftCommit, include a maxTime between 1000 and 5000 (milliseconds) and leave maxDocs out. I did that but without autoSoftCommit because I need control over when the commits happen and soft-commit in my application. Thank you so much, Robert
Re: copyField generates multiple values encountered for non multiValued field
On Wed, Jun 5, 2013 at 9:12 PM, Jack Krupansky j...@basetechnology.com wrote: Look in the Solr log - the error message should tell you what the multiple values are. For example, 95484 [qtp2998209-11] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: ERROR: [doc=doc-1] multiple values encountered for non multiValued field content_s: [def, abc] One of the values should be the value of the field that is the source of the copyField. Maybe the other value will give you a clue as to where it came from. Check your SolrJ code - maybe you actually do try to initialize a value in the field that is the copyField target. I see the values in the stack trace: org.apache.solr.common.SolrException: ERROR: [doc=8f60d040-3462-4b28-998f-fd05a64f1cd8:/] multiple values encountered for non multiValued field name2: [rename, rename] It is just twice the value of source-field and I am not referencing that field in my java code.
Re: copyField generates multiple values encountered for non multiValued field
I don't know what I have to do to use the atomic update feature but I am not aware of using it. But the way you describe it, it means that the copyField directive does not overwrite the existing field content and that's an easy explanation to what is happening in my case. Then the second update (which I do manually, i.e. read current state, manipulate fields and then add the document with the same id) will lead to this. That was not so obvious to me from the docs. Thanks, Robert On Thu, Jun 6, 2013 at 12:18 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I updated the Index using SolrJ and got the exact same error message there aren't a lot of specifics provided in this thread, so this may not be applicable, but if you mean you actaully using the atomic updates feature to update an existing document then the problem is that you still have the existing value in your name2 field, as well as another copy of the name field evaluated by copyField after the updates are applied... http://wiki.apache.org/solr/Atomic_Updates#Stored_Values -Hoss
Re: copyField generates multiple values encountered for non multiValued field
On Thu, Jun 6, 2013 at 1:52 PM, Jack Krupansky j...@basetechnology.com wrote: read current state, manipulate fields and then add the document with the same id) Ahh... then you have an IMPLICIT reference to the field in your Java code - you explicitly told Solr that you wanted to start with all existing field values. Just because a field is the target of a copyField doesn't make it any different from any other field when reading. Although, it does beg the question of whether or not this field should be stored or not - that's a data modeling question that only you can resolve. Do queries need to retrieve this field? you're right. in my concrete use case it does not need to to be stored. Be sure to null out any values for any fields that are sourced by copy fields. Otherwise, yes, duplicated values would be exactly what you should expect. yes, I will do that. Is there any reason that you can't simply use atomic update - create a new document with the same document id but with only set values for the fields to be changed? There is also add for multivalued fields. There isn't great doc for this. Basically, the value for every non-ID field would be a Map object (HashMap) with a set key whose value is the new field value. Here's a code fragment for setting one field: SolrInputDocument doc2 = new SolrInputDocument(); MapString,String fpValue2 = new HashMapString, String(); fpValue2.put(set,fp2); doc2.setField(FACTURES_PRODUIT, fpValue2); You need a separate Map object for each field to be set or added for appending to a multivalued field. And you need a simple (non-Map) value for your ID field. thanks for the info! the code is a lot older than solr 4.0, so that option was not available at the time of its writing. I will check if it makes sense to use that feature. most likely yes. Robert
copyField generates multiple values encountered for non multiValued field
I have the exact same problem as the guy here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201105.mbox/%3C3A2B3E42FCAA4BF496AE625426C5C6E4@Wurstsemmel%3E AFAICS he did not get an answer. Is this a known issue? What can I do other than doing what copyField should do in my application? I am using solr 4.0.0. Thanks, Robert
Re: copyField generates multiple values encountered for non multiValued field
OK, I have two fields defined as follows: field name=name type=string indexed=true stored=true multiValued=false / field name=name2 type=string_ci indexed=true stored=true multiValued=false / and this copyField directive copyField source=name dest=name2/ I updated the Index using SolrJ and got the exact same error message that is in the subject. However, while waiting for feedback I built a workaround at the application level and now reconstructing the original state, to be able to answer you, I have different behaviour. What happens now is that the field name2 is populated with multiple values although it is not defined as multiValued (see above). Although this is strange, it is consistent with the earlier problem in that copyField does not seem to overwrite the existing field values. I may be using it incorrectly (it's the first time I am using copyField) but the docs in the wiki did not say anything about an overwrite option. Cheers, Robert On Wed, Jun 5, 2013 at 5:16 PM, Jack Krupansky j...@basetechnology.com wrote: Try describing your own symptom in your own words - because his issue related to Solr 1.4. I mean, where exactly are you setting allowDuplicates=false?? And why do you think it has anything to do with adding documents to Solr? Solr 1.4 did not have atomic update, so sending the exact same document twice would not result in a change in the index (unless you had a date field with a value of NOW.) Copy field only uses values from the current document. -- Jack Krupansky -Original Message- From: Robert Krüger Sent: Wednesday, June 05, 2013 10:37 AM To: solr-user@lucene.apache.org Subject: copyField generates multiple values encountered for non multiValued field I have the exact same problem as the guy here: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201105.mbox/%3C3A2B3E42FCAA4BF496AE625426C5C6E4@Wurstsemmel%3E AFAICS he did not get an answer. Is this a known issue? What can I do other than doing what copyField should do in my application? I am using solr 4.0.0. Thanks, Robert
Re: uniqueKey not enforced
On Tue, Oct 23, 2012 at 2:37 PM, Erick Erickson erickerick...@gmail.com wrote: From left field: Try looking at your admin/schema browser page for the ID in question. That actually gets stuff out of your index (the actual indexed terms). See if you have two values I'm running embedded, so I don't have that. However I have a simple UI for performing queries and the duplicate records are displayed issuing a *:* query. for that ID. In which case you _might_ have spaces before or after the value somehow. I notice your comment says something about computed, so... Since String types are totally unanalyzed, spaces would count. No, the way the id is computed can not lead to leading or trailing whitespace. you can also use the TermsComponent to see what's there, see: http://wiki.apache.org/solr/TermsComponent I'll take a look. Best Erick Thanks, Robert
Re: uniqueKey not enforced
On Wed, Oct 24, 2012 at 2:03 PM, Erick Erickson erickerick...@gmail.com wrote: Robert: But you do have an index somewhere, so the alternative for looking at it low-level would be 1 get a copy of Luke and point it at your index. Very useful tool I will do that, next time I have that condition. Unfortunately I didn't back up the index files when that happened. Thanks for the advice! Robert
uniqueKey not enforced
Hi, I noticed a duplicate entry in my index and I am wondering how that can be, because I have a uniqueKey defined. I have the following defined in my schema.xml: ?xml version=1.0 ? schema name=main core version=1.1 types fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ other field types omitted here ... /fieldType /types fields !-- general -- !-- id computed as a combination of media id and path -- field name=id type=string indexed=true stored=true multiValued=false / other fields omitted here ... /fields !-- field to use to determine and enforce document uniqueness. -- uniqueKeyid/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldname/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=OR/ /schema And now I have two records which both have the value 4b34b883-a9d9-428a-92c3-ba1a69d96a70:/Düsendrögl in its id field. Is it the Non-ASCII chars that cause the uniqueness enforcement to fail? I am using Solr 3.6.1. Any ideas what's going on? Thanks, Robert
Re: uniqueKey not enforced
On Mon, Oct 22, 2012 at 2:08 PM, Jack Krupansky j...@basetechnology.com wrote: Which release of Solr? 3.6.1 Is this a single node Solr or distributed or cloud? single node, actually embedded in an application. Is is possible that you added documents with the overwrite=false attribute? That would suppress the uniqueness test. no, I just used SolrServer.add(CollectionSolrInputDocument docs) Is it possible that you added those documents before adding the uniqueKey element to your schema, or added uniqueKey but did not restart Solr before adding those documents? no, the element has been there for months, the index has been created from scratch just before the test One minor difference from the Solr example schema is that your id field does not have required=true. I don't think that should matter (Solr will force the uniqueKey field to be required in documents), but I am curious how you managed to get an id field different from the Solr example. so am I ;-). I will add the required attribute, though. It cannot hurt.
Re: uniqueKey not enforced
On Mon, Oct 22, 2012 at 6:01 PM, Jack Krupansky j...@basetechnology.com wrote: And, are you using UUID's or providing specific key values? specific key values
Re: Config parameters to tweak for update performance
On Tue, Oct 16, 2012 at 4:13 PM, Shawn Heisey s...@elyograg.org wrote: On 10/16/2012 5:38 AM, Robert Krüger wrote: I use solr embedded in a desktop app and due to the consistency requirements of the application I have to commit rather often. Are there some best practices on how to optimize commit performance via the configuration? I could easily live with slower queries or more memory use as my index is rather small and queries are more than fast enough for my application. As I understand it, the new softCommit in Solr 4.0 is designed to address this exact issue -- near realtime search. Basically new data is committed to RAM, which happens super-fast. On a longer interval, you issue a hard commit, which writes the data in RAM to disk. For consistency through failures, you may need to have your application keep track of the last document successfully hard committed. Solr 4.0 has a new updateLog capability that might make this unnecessary, someone with more experience will have to confirm or deny that. Thanks, Shawn Great info! Thanks! I'll take a look at 4.0 then.
Config parameters to tweak for update performance
Hi, I use solr embedded in a desktop app and due to the consistency requirements of the application I have to commit rather often. Are there some best practices on how to optimize commit performance via the configuration? I could easily live with slower queries or more memory use as my index is rather small and queries are more than fast enough for my application. Thanks in advance, Robert
Re: Can I rely on correct handling of interrupted status of threads?
On Tue, Oct 2, 2012 at 11:48 AM, Robert Krüger krue...@lesspain.de wrote: Hi, I'm using Solr 3.6.1 in an application embedded directly, i.e. via EmbeddedSolrServer, not over an HTTP connection, which works perfectly. Our application uses Thread.interrupt() for canceling long-running tasks (e.g. through Future.cancel). A while (and a few Solr versions) back a colleague of mine implemented a workaround because he said that Solr didn't handle the thread's interrupted status correctly, i.e. not setting the interrupted status after having caught an InterruptedException or rethrowing it, thus killing the information that an interrupt has been requested, which breaks libraries relying on that. However, I did not find anything up-to-date in mailing list or forum archives on the web. Is that still or was it ever the case? What does one have to watch out for when interrupting a thread that is doing anything within Solr/Lucene? Any advice would be appreciated. Regards, Robert Just in case anyone else has the same question: You cannot. Thread interruption is not handled properly so you should not use Solr in code that you plan to interrupt via Thread.interrupt. You will get stuff like IOExceptions so you cannot cleanly tell the difference between real errors and stuff caused by interruption. Reading old Jira issues, it looks as if this was a conscious decision because of the amount of work it would cause in API design changes and lots of checked exceptions the caller would have to handle.
Re: Can I rely on correct handling of interrupted status of threads?
On Tue, Oct 2, 2012 at 8:50 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: I remember a bug in EmbeddedSolrServer at 1.4.1 when exception bypasses request closing that lead to searcher leak and OOM. It was fixed about two years ago. You mean InterruptedException?
Can I rely on correct handling of interrupted status of threads?
Hi, I'm using Solr 3.6.1 in an application embedded directly, i.e. via EmbeddedSolrServer, not over an HTTP connection, which works perfectly. Our application uses Thread.interrupt() for canceling long-running tasks (e.g. through Future.cancel). A while (and a few Solr versions) back a colleague of mine implemented a workaround because he said that Solr didn't handle the thread's interrupted status correctly, i.e. not setting the interrupted status after having caught an InterruptedException or rethrowing it, thus killing the information that an interrupt has been requested, which breaks libraries relying on that. However, I did not find anything up-to-date in mailing list or forum archives on the web. Is that still or was it ever the case? What does one have to watch out for when interrupting a thread that is doing anything within Solr/Lucene? Any advice would be appreciated. Regards, Robert
How to Index and query URLs as fields
Hi, I've run into problems trying to achieve a seemingly simple thing. I'm indexing a bunch of files (local ones and potentially some accessible via other protocols like http or ftp) and have an index field with the url to the file, e.g. file:/home/foo/bar.pdf. Now I want to perform two simple types of queries on this, i.e. retrieve all file records located under a certain path (e.g. file:/home/foo/*) or find the file record for an exact URL. What I naively tried was to index the file URL in a field (fileURL) of type string and simply perform queries like fileURL:file\:/home/foo/* and fileURL:file\:/home/foo/bar.pdf and neither one returned results. the type is defined as fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ and the field as field name=fileURL type=string indexed=true stored=true multiValued=false / I am using solr 1.4.1 and use solrj to do the indexing and querying. This seems like a rather basic requirement and obviously I am doing something wrong. I didn't find anything in the docs or the mailing list archive so far. Any help, hints, pointers would be appreciated. Robert
Re: How to Index and query URLs as fields
My mistake. The error turned out to be somewhere else and the described approach seems to work. Sorry for the wasted bandwidth. On Mar 8, 2011, at 11:06 AM, Robert Krüger wrote: Hi, I've run into problems trying to achieve a seemingly simple thing. I'm indexing a bunch of files (local ones and potentially some accessible via other protocols like http or ftp) and have an index field with the url to the file, e.g. file:/home/foo/bar.pdf. Now I want to perform two simple types of queries on this, i.e. retrieve all file records located under a certain path (e.g. file:/home/foo/*) or find the file record for an exact URL. What I naively tried was to index the file URL in a field (fileURL) of type string and simply perform queries like fileURL:file\:/home/foo/* and fileURL:file\:/home/foo/bar.pdf and neither one returned results. the type is defined as fieldtype name=string class=solr.StrField sortMissingLast=true omitNorms=true/ and the field as field name=fileURL type=string indexed=true stored=true multiValued=false / I am using solr 1.4.1 and use solrj to do the indexing and querying. This seems like a rather basic requirement and obviously I am doing something wrong. I didn't find anything in the docs or the mailing list archive so far. Any help, hints, pointers would be appreciated. Robert
Pragmatic more or less high availability option on 2 servers
Hi, I have to set up a SOLR cluster with some availability concept (is allowed to require manual interaction on fault, however, if there is a better way, I'd be interested in recommendations). I have two servers (A and B for the example) at my disposal. What I was thinking about was the following setup: Set up 2 SOLR Instances on each server, i.e. on A) - One master (active) - One slave (always active) replicating index from the master using the mechanisms described in http://wiki.apache.org/solr/SolrReplication on B) - One master (active but not used for index updates) replicating index form the master on A - One slave (always active) replicating index from the master on A I'll write the configs with placeholders for properties for the address of the master server so I can start the slaves with the master on A or on B as the master to replicate from. When B goes down, nothing happens as the slave on A is still there to serve queries as well as the master to serve update requests. When A goes down updates will begin to fail but queries will still be served from the slave on B. A restart of B with the until then inactive master as master will bring index update functionality up again. For simplicity's sake I did not mention any load-balancing or IP address switching stuff. I assume that I have to load-balance the access to the two slaves and make sure that when the master is switched to B that either it assumes the IP address or the clients updating the index need to be changed )the forme probably being simpler). Now, my questions are: - Will this work? - Can this be simplified, i.e. does it make sense to have separate master/slave instances when they are on the same hardware or is it enough or even better to write the config of the slave on B so that it can be restarted in master mode based on the same index? Thanks in advance, Robert
Re: Boosting by date when only some records have one
Hi Lance, thanks for your feedback! The problem with your suggestion is that we do not want to exclude the fields without a date but boost them differently, which Chris has provided a solution for (see other posts in this thread). Cheers, Robert Norskog, Lance wrote: This query:field:[* TO *] Gives the entries which have a value for that field. Documents with nothing in 'field' will not return. To get the opposite set: *:* -field:[* TO *] Does this help? Lance -Original Message- From: Robert Krüger [mailto:krue...@signal7.de] Sent: Saturday, December 13, 2008 4:22 AM To: solr-user@lucene.apache.org Subject: Boosting by date when only some records have one Hi, I'm looking for a way to boost queries by date but not all documents have a date associated with them. My goal is to have something like a default boost for documents (e.g. 1.0) with a function for documents with dates that distribute the boosts between 1.0 - x to 1.0 + x based on a valid date range. So far I have not been able to formulate a query that does that. The query I tried resulted in documents being returned as results, which without the boost function, weren't returned, which renders it useless. I have criteria in the schema based which I know if that document is supposed to be boosted by date or not. So in pseudocode what I wanted to have is: if(document.dateField != null){ boost = somefunction(document.dateField); } else{ boost = 1; } or (using another field as criterion) if(document.hasDateField == 1){ boost = somefunction(document.dateField); } else{ boost = 1; } Any suggestions or sample queries anyone? Many thanks in advance, Robert
Re: Boosting by date when only some records have one
Hi, thanks a lot! Looks like what I need except that I cannot use dismax because I need to be able to do prefix queries. I'm new to Solr, so I don't know if there's a way to formulate this in a standard query. If not, I could extend DismaxRequestHandler so it doesn't escape the *s, right? Robert Chris Hostetter wrote: : if(document.hasDateField == 1){ : boost = somefunction(document.dateField); : } else{ : boost = 1; : } bq = ( ( +hasDateField:true _val_:somefunc(dateField) ) ( -hasDateField:true *:*^1 ) ) That covers the possiblility that hasDateField is not set for some docs. The query get's simpler if you can concretely know that hasDateField will always have a value of true or false... bq = ( hasDateField:false^1 ( +hasDateField:true _val_:somefunc(dateField) ) -Hoss
Is making removal of wildcards configurable planned for DisMaxRequestHandler
Hi, I'm rather new to Solr and for my current projects came to the conclusion that DisMaxRequestHandler is exactly the tool I need, except that it doesn't allow prefix queries. I found a thread in the archive were someone mentioned the idea of making this behaviour configurable (which characters to strip from the query parameter). Is someone working on this or is my best option currently to implement this behaviour by copying code from DisMaxRequestHandler and modify the code that strips illegal operators? Thanks in advance, Robert
Re: Boosting by date when only some records have one
Chris Hostetter wrote: : thanks a lot! Looks like what I need except that I cannot use dismax : because I need to be able to do prefix queries. I'm new to Solr, so I there's nothing dismax related in that syntax, i just suggested using it in a bq param becuase i assumed that's what you were using. q = +pre:fix* (hasDateField:false^1 (+hasDateField:true _val_:somefunc(dateField))) : bq = ( hasDateField:false^1 ( +hasDateField:true _val_:somefunc(dateField) ) -Hoss perfect, thanks!! Robert
Boosting by date when only some records have one
Hi, I'm looking for a way to boost queries by date but not all documents have a date associated with them. My goal is to have something like a default boost for documents (e.g. 1.0) with a function for documents with dates that distribute the boosts between 1.0 - x to 1.0 + x based on a valid date range. So far I have not been able to formulate a query that does that. The query I tried resulted in documents being returned as results, which without the boost function, weren't returned, which renders it useless. I have criteria in the schema based which I know if that document is supposed to be boosted by date or not. So in pseudocode what I wanted to have is: if(document.dateField != null){ boost = somefunction(document.dateField); } else{ boost = 1; } or (using another field as criterion) if(document.hasDateField == 1){ boost = somefunction(document.dateField); } else{ boost = 1; } Any suggestions or sample queries anyone? Many thanks in advance, Robert