LIRE Solr plugin updated to 4.10.2 and new demo ...
Hi all! After the initial release I finally came around to update the content based image retrieval plugin LIRE Solr to the current version and it has been extended to support more CBIR features. https://bitbucket.org/dermotte/liresolr I also took the freedom to update the web client and the demo installation, so feel free to try it, it's now running on a shared server featuring a million of photos. The new features are the image similarity based re-ranking -- you first search for a tag in a text box and then re-rank the result based on an example image -- and a new way of content based search utilizing the standard handler but scoring with a funtcion (it's slower, but more flexible). Feel free to test it and let me know what you think :) http://demo-itec.uni-klu.ac.at/liredemo/ cheers, Mathias -- Priv.-Doz. Dr. Dipl.-Ing. Mathias Lux Associate Professor at Klagenfurt University, Austria http://tinyurl.com/mlux-itec ... contact and cv
Re: DataImport Handler, writing a new EntityProcessor
Hi! Thanks for all the advice! I finally did it, the most annoying error that took me the best of a day to figure out was that the state variable here had to be reset: https://bitbucket.org/dermotte/liresolr/src/d27878a71c63842cb72b84162b599d99c4408965/src/main/java/net/semanticmetadata/lire/solr/LireEntityProcessor.java?at=master#cl-56 The EntityProcessor is part of this image search plugin if anyone is interested: https://bitbucket.org/dermotte/liresolr/ :) It's always the small things that are hard to find cheers and thanks, Mathias On Wed, Dec 18, 2013 at 7:26 PM, P Williams williams.tricia.l...@gmail.com wrote: Hi Mathias, I'd recommend testing one thing at a time. See if you can get it to work for one image before you try a directory of images. Also try testing using the solr-testframework using your ide (I use Eclipse) to debug rather than your browser/print statements. Hopefully that will give you some more specific knowledge of what's happening around your plugin. I also wrote an EntityProcessor plugin to read from a properties filehttps://issues.apache.org/jira/browse/SOLR-3928. Hopefully that'll give you some insight about this kind of Solr plugin and testing them. Cheers, Tricia On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux m...@itec.uni-klu.ac.atwrote: Hi all! I've got a question regarding writing a new EntityProcessor, in the same sense as the Tika one. My EntityProcessor should analyze jpg images and create document fields to be used with the LIRE Solr plugin (https://bitbucket.org/dermotte/liresolr). Basically I've taken the same approach as the TikaEntityProcessor, but my setup just indexes the first of 1000 images. I'm using a FileListEntityProcessor to get all JPEGs from a directory and then I'm handing them over (see [2]). My code for the EntityProcessor is at [1]. I've tried to use the DataSource as well as the filePath attribute, but it ends up all the same. However, the FileListEntityProcessor is able to read all the files according to the debug output, but I'm missing the link from the FileListEntityProcessor to the LireEntityProcessor. I'd appreciate any pointer or help :) cheers, Mathias [1] LireEntityProcessor http://pastebin.com/JFajkNtf [2] dataConfig http://pastebin.com/vSHucatJ -- Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec -- PD Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec
DataImport Handler, writing a new EntityProcessor
Hi all! I've got a question regarding writing a new EntityProcessor, in the same sense as the Tika one. My EntityProcessor should analyze jpg images and create document fields to be used with the LIRE Solr plugin (https://bitbucket.org/dermotte/liresolr). Basically I've taken the same approach as the TikaEntityProcessor, but my setup just indexes the first of 1000 images. I'm using a FileListEntityProcessor to get all JPEGs from a directory and then I'm handing them over (see [2]). My code for the EntityProcessor is at [1]. I've tried to use the DataSource as well as the filePath attribute, but it ends up all the same. However, the FileListEntityProcessor is able to read all the files according to the debug output, but I'm missing the link from the FileListEntityProcessor to the LireEntityProcessor. I'd appreciate any pointer or help :) cheers, Mathias [1] LireEntityProcessor http://pastebin.com/JFajkNtf [2] dataConfig http://pastebin.com/vSHucatJ -- Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: DataImport Handler, writing a new EntityProcessor
Unfortunately it is the same in non-debug, just the first document. I also output the params to sout, but it seems only the first one is ever arriving at my custom class. I've the feeling that I'm doing something seriously wrong here, based on a complete misunderstanding :) I basically assume that the nested entity processor will be called for each of the rows that come out from its parent. I've read somewhere, that the data has to be taken from the data source, and I've implemented that, but it doesn't seem to change anything. cheers, Mathias On Wed, Dec 18, 2013 at 3:05 PM, Dyer, James james.d...@ingramcontent.com wrote: The first thing I would suggest is to try and run it not in debug mode. DIH's debug mode limits the number of documents it will take in, so that might be all that is wrong here. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: mathias@gmail.com [mailto:mathias@gmail.com] On Behalf Of Mathias Lux Sent: Wednesday, December 18, 2013 4:04 AM To: solr-user@lucene.apache.org Subject: DataImport Handler, writing a new EntityProcessor Hi all! I've got a question regarding writing a new EntityProcessor, in the same sense as the Tika one. My EntityProcessor should analyze jpg images and create document fields to be used with the LIRE Solr plugin (https://bitbucket.org/dermotte/liresolr). Basically I've taken the same approach as the TikaEntityProcessor, but my setup just indexes the first of 1000 images. I'm using a FileListEntityProcessor to get all JPEGs from a directory and then I'm handing them over (see [2]). My code for the EntityProcessor is at [1]. I've tried to use the DataSource as well as the filePath attribute, but it ends up all the same. However, the FileListEntityProcessor is able to read all the files according to the debug output, but I'm missing the link from the FileListEntityProcessor to the LireEntityProcessor. I'd appreciate any pointer or help :) cheers, Mathias [1] LireEntityProcessor http://pastebin.com/JFajkNtf [2] dataConfig http://pastebin.com/vSHucatJ -- Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec -- PD Dr. Mathias Lux Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Query result caching with custom functions
Hi Joel, I just tested with custom equals and hashcode ... what I basically did is that I created a string object based on all the function values and used this for the equals (with an instanceof) and for the hash method. The result was quite the same as before, all treszults are cashed unless I set the queryResultCache size to 0 in the solrconfig.xml cheers, Mathias On Thu, Oct 24, 2013 at 4:51 PM, Joel Bernstein joels...@gmail.com wrote: Mathias, I'd have to do a close review of the function sort code to be sure, but I suspect if you implement the equals() method on the ValueSource it should solve your caching issue. Also implement hashCode(). Joel On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote: On 10/24/2013 5:35 AM, Mathias Lux wrote: I've written a custom function, which is able to provide a distance based on some DocValues to re-sort result lists. This basically works great, but we've got the problem that if I don't change the query, but the function parameters, Solr delivers a cached result without re-ordering. I turned off caching and see there, problem solved. But of course this is not a avenue I want to pursue further as it doesn't make sense for a prodcutive system. Do you have any ideas (beyond fake query modification and turning off caching) to counteract? btw. I'm using Solr 4.4 (so if you are aware of the issue and it has been resolved in 4.5 I'll port it :) The code I'm using is at https://bitbucket.org/dermotte/liresolr I suspect that the queryResultCache is not paying attention to the fact that parameters for your plugin have changed. This probably means that your plugin must somehow inform the cache check code that something HAS changed. How you actually do this is a mystery to me because it involves parts of the code that are beyond my understanding, but it MIGHT involve making sure that parameters related to your code are saved as part of the entry that goes into the cache. Thanks, Shawn -- PD Dr. Mathias Lux Associate Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Query result caching with custom functions
Hi all! Got a question on the Solr cache :) I've written a custom function, which is able to provide a distance based on some DocValues to re-sort result lists. This basically works great, but we've got the problem that if I don't change the query, but the function parameters, Solr delivers a cached result without re-ordering. I turned off caching and see there, problem solved. But of course this is not a avenue I want to pursue further as it doesn't make sense for a prodcutive system. Do you have any ideas (beyond fake query modification and turning off caching) to counteract? btw. I'm using Solr 4.4 (so if you are aware of the issue and it has been resolved in 4.5 I'll port it :) The code I'm using is at https://bitbucket.org/dermotte/liresolr regards, Mathias -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Query result caching with custom functions
That's a possibility, I'll try that and report on the effects. Thanks, Mathias Am 24.10.2013 16:52 schrieb Joel Bernstein joels...@gmail.com: Mathias, I'd have to do a close review of the function sort code to be sure, but I suspect if you implement the equals() method on the ValueSource it should solve your caching issue. Also implement hashCode(). Joel On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote: On 10/24/2013 5:35 AM, Mathias Lux wrote: I've written a custom function, which is able to provide a distance based on some DocValues to re-sort result lists. This basically works great, but we've got the problem that if I don't change the query, but the function parameters, Solr delivers a cached result without re-ordering. I turned off caching and see there, problem solved. But of course this is not a avenue I want to pursue further as it doesn't make sense for a prodcutive system. Do you have any ideas (beyond fake query modification and turning off caching) to counteract? btw. I'm using Solr 4.4 (so if you are aware of the issue and it has been resolved in 4.5 I'll port it :) The code I'm using is at https://bitbucket.org/dermotte/liresolr I suspect that the queryResultCache is not paying attention to the fact that parameters for your plugin have changed. This probably means that your plugin must somehow inform the cache check code that something HAS changed. How you actually do this is a mystery to me because it involves parts of the code that are beyond my understanding, but it MIGHT involve making sure that parameters related to your code are saved as part of the entry that goes into the cache. Thanks, Shawn
Re: Re-Ranking results based on DocValues with custom function.
Got it! Just for you to share ... and maybe for inclusion in the Java API docs of ValueSource :) For sorting one needs to implement the method public double doubleVal(int) of the class ValueSource then it works like a charm. cheers, Mathias On Tue, Sep 17, 2013 at 6:28 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : It basically allows for searching for text (which is associated to an : image) in an index and then getting the distance to a sample image : (base64 encoded byte[] array) based on one of five different low level : content based features stored as DocValues. very cool. : So there one little tiny question I still have ;) When I'm trying to : do a sort I'm getting : : msg: sort param could not be parsed as a query, and is not a field : that exists in the index: : lirefunc(cl_hi,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=), : : for the call http://localhost:9000/solr/lire/select?q=*%3A*sort=lirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)+ascfl=id%2Ctitle%2Clirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)wt=jsonindent=true Hmmm... i think the crux of the issue is your string literal. function parsing tries to make live easy for you by not requiring string literals to be quoted unless they conflict with other function names or field names etc on top of that the sort parsing code is kind of hueristic based (because it has to account for both functions or field names or wildcards, followed by other sort clauses, etc...) so in that context the special characters like '=' in your base64 string literal might be confusing hte hueristics. can you try to quote the string literal it and see if that works? For example, when i try using strdist with your base64 string in a sort param using the example configs i get the same error... http://localhost:8983/solr/select?q=*:*sort=strdist%28name,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=,jw%29+asc but if i quote the string literal it works fine... http://localhost:8983/solr/select?q=*:*sort=strdist%28name,%27FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=%27,jw%29+asc -Hoss -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Re-Ranking results based on DocValues with custom function.
Hi! Thanks for the directions! I got it up and running with a custom ValueSourceParser: http://pastebin.com/cz1rJn4A and a custom ValueSource: http://pastebin.com/j8mhA8e0 It basically allows for searching for text (which is associated to an image) in an index and then getting the distance to a sample image (base64 encoded byte[] array) based on one of five different low level content based features stored as DocValues. A sample result is here: http://pastebin.com/V7kL3DJh So there one little tiny question I still have ;) When I'm trying to do a sort I'm getting msg: sort param could not be parsed as a query, and is not a field that exists in the index: lirefunc(cl_hi,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=), for the call http://localhost:9000/solr/lire/select?q=*%3A*sort=lirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)+ascfl=id%2Ctitle%2Clirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)wt=jsonindent=true cheers, Mathias On Tue, Sep 17, 2013 at 1:01 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : dissimilarity functions). What I want to do is to search using common : text search and then (optionally) re-rank using some custom function : like : : http://localhost:8983/solr/select?q=*:*sort=myCustomFunction(var1) asc can you describe what you want your custom function to look like? it may already be possible using the existing functions provided out of hte box - just neeed to combine them to build up the mathc expression... https://wiki.apache.org/solr/FunctionQuery ...if you really want to write your own, just implement ValueSourceParser and register it in solrconfig.xml... https://wiki.apache.org/solr/SolrPlugins#ValueSourceParser : I've seen that there are hooks in solrconfig.xml, but I did not find : an example or some documentation. I'd be most grateful if anyone could : either point me to one or give me a hint for another way to go :) when writing a custom plugin like this, the best thing to do is look at the existing examples of that plugin. almost all of hte built in ValueSourceParsers are really trivial, and can be found in tiny anonymous classes right inside the ValueSourceParser.java... For example, the function ot divide the results of two other fnctions... addParser(div, new ValueSourceParser() { @Override public ValueSource parse(FunctionQParser fp) throws SyntaxError { ValueSource a = fp.parseValueSource(); ValueSource b = fp.parseValueSource(); return new DivFloatFunction(a, b); } }); ..or, if you were trying to bundle that up in your own plugin jar and register it in solrconfig.xml, you might write it something like... public class DivideValueSourceParser extends ValueSourceParser { public DivideValueSourceParser() { } public ValueSource parse(FunctionQParser fp) throws SyntaxError { ValueSource a = fp.parseValueSource(); ValueSource b = fp.parseValueSource(); return new DivFloatFunction(a, b); } } and then register it as... valueSourceParser name=div class=com.you.DivideValueSourceParser / depending on your needs, you may also want to write a custom ValueSource implementation (ie: instead of DivFloatFunction above) in which case, again, the best examples to look at are all of the existing ValueSource functions... https://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/function/ValueSource.html -Hoss -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Scoring by document size
As the IDF values for A, B and C are minimal (couldn't get any worse than being in any document), the major part of your score comes most likely from the coord(..) part of scoring - which basically computes the overlap of the query and the document. If you want to have a stronger influence you can extend and override the Similarity implementation. You might take a look at http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html cheers, Mathias On Tue, Sep 17, 2013 at 1:59 PM, Upayavira u...@odoko.co.uk wrote: Have you used debugQuery=true, or fl=*,[explain], or those various functions? It is possible to ask Solr to tell you how it calculated the score, which will enable you to see what is going on in each case. You can probably work it out for yourself then I suspect. Upayavira On Tue, Sep 17, 2013, at 08:40 AM, blopez wrote: Hi all, I have some doubts about the Solr scoring function. I'm using all default configuration, but I'm facing a wired issue with the retrieved scores. In the schema, I'm going to focus in the only field I'm interested in. Its definition is: *fieldType name=text class=solr.TextField sortMissingLast=true omitNorms=false analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ /analyzer /fieldType field name=myField type=text indexed=true stored=true required=false /* (omitNorms=false, if not, the document size is not taken into account to the final score) Then, I index some documents, with the following text in the 'myField' field: doc1 = A B C doc2 = A B C D doc3 = A B C D E doc4 = A B C D E F doc5 = A B C D E F G H doc6 = A B C D E F G H I Finally, I perform the query 'myField:(A B C)' in order to recover all the documents, but with different scoring (doc1 is more similar to the query than doc2, which is more similar than doc3, ...). All the documents are retrieved (OK), but the scores are like this: *doc1 = 2,590214 doc2 = 2,590214* doc3 = 2,266437 *doc4 = 1,94266 doc5 = 1,94266* doc6 = 1,618884 So in conclussion, as you can see the score goes down, but not the way I'd like. Doc1 is getting the same scoring than Doc2, even when Doc1 matches 3/3 tokens, and Doc2 matches 3/4 tokens. Is this the normal Solr behaviour? Is there any way to get my expected behaviour? Thanks a lot, Borja. -- View this message in context: http://lucene.472066.n3.nabble.com/Scoring-by-document-size-tp4090523.html Sent from the Solr - User mailing list archive at Nabble.com. -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re-Ranking results based on DocValues with custom function.
Hi! I'm having quite an index with a lot of text and some binary data in the documents (numeric vectors of arbitrary size with associated dissimilarity functions). What I want to do is to search using common text search and then (optionally) re-rank using some custom function like http://localhost:8983/solr/select?q=*:*sort=myCustomFunction(var1) asc I've seen that there are hooks in solrconfig.xml, but I did not find an example or some documentation. I'd be most grateful if anyone could either point me to one or give me a hint for another way to go :) Btw. Using just the DocValues for search is handled by a custom RequestHandler, which works great, but using text as a main search feature, and my DocValues for re-ranking, I'd rather just add a function for sorting and use the current, stable and well performing request handler. cheers, Mathias ps. a demo of the current system is available at: http://demo-itec.uni-klu.ac.at/liredemo/ -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Is there a way to store binary data (byte[]) in DocValues?
Hi! I'm basically searching for a method to put byte[] data into Lucene DocValues of type BINARY (see [1]). Currently only primitives and Strings are supported according to [1]. I know that this can be done with a custom update handler, but I'd like to avoid that. cheers, Mathias [1] http://wiki.apache.org/solr/DocValues -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Is there a way to store binary data (byte[]) in DocValues?
Hi! That's what I'm doing currently, but it ends up in StoredField implementations, which create an overhead on decompression I want to avoid. cheers, Mathias On Mon, Aug 12, 2013 at 3:11 PM, Raymond Wiker rwi...@gmail.com wrote: base64-encode the binary data? That will give you strings, at the expense of some storage overhead. On Mon, Aug 12, 2013 at 2:38 PM, Mathias Lux m...@itec.uni-klu.ac.atwrote: Hi! I'm basically searching for a method to put byte[] data into Lucene DocValues of type BINARY (see [1]). Currently only primitives and Strings are supported according to [1]. I know that this can be done with a custom update handler, but I'd like to avoid that. cheers, Mathias [1] http://wiki.apache.org/solr/DocValues -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
Re: Is there a way to store binary data (byte[]) in DocValues?
Hi Robert, I'm basically mis-using Solr for content based image search. So I have indexed fields (hashes) for candidate selection, i.e. 1,500 candidate results retrieved with the IndexSearcher by hashes, which I then have to re-rank based on numeric vectors I'm storing in byte[] arrays. I had an implementation, where this is based on the binary field but reading from an index with a lot of small stored field is not a good idea with the current compression approach (I've already discussed this in the Lucene user group :) BINARY is the thing for me to go for, as you said, there's nothing, just the values. Another thing for not using the the SORTED_SET and SORTED implementations is, that Solr currently works with Strings on that and I want to have a small memory footprint for millions of images ... which does not go well with immutables. However, I now already have a solution, which I just wanted to post here when I saw your answer. Basically I copied the source from the BinaryField and changed it to a BinaryDocValuesField (see line 68 at http://pastebin.com/dscPTwhr). This works out well for indexing when you adapt the schema to use this class: [...] !-- ColorLayout -- field name=cl_ha type=text_ws indexed=true stored=false required=false/ field name=cl_hi type=binaryDV indexed=false stored=true required=false/ [...] fieldtype name=binaryDV class=net.semanticmetadata.lire.solr.BinaryDocValuesField/ [...] I then have a custom request handler, that does the search for me. First based on the hashes (field cl_ha, treated as whitespace delimited terms) and then re-ranking the 1,500 first results based on the DocValues. Now it works rather fast, a demo with 1M images is available at http://demo-itec.uni-klu.ac.at/liredemo/ .. hash based search time is still not optimal, but that's an issue of the distribution of terms, which is not optimal for this kind of index (find the runtime separated in search re-rank at the end of the page). I'll put the whole (open, GPL-ed) source online at the end of September (as module of LIRE), after some stress tests, documentation and further bug fixing. cheers, Mathias On Mon, Aug 12, 2013 at 4:51 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux m...@itec.uni-klu.ac.at wrote: Hi! I'm basically searching for a method to put byte[] data into Lucene DocValues of type BINARY (see [1]). Currently only primitives and Strings are supported according to [1]. I know that this can be done with a custom update handler, but I'd like to avoid that. Can you describe a little bit what kind of operations you want to do with it? I don't really know how BinaryField is typically used, but maybe it could support this option. On the other hand adding it to BinaryField might not buy you much without some additional stuff depending upon what you need to do. Like if you really want to do sort/facet on the thing, SORTED(SET) would probably be a better implementation: it doesnt care that the values are binary. BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is: * SORTED: deduplicates/compresses the unique byte[]'s and gives each document an ordinal number that reflects sort order (for sorting/faceting/grouping/etc) * SORTED_SET: similar, except each document has a set (which can be empty), of ordinal numbers (e.g. for faceting multivalued fields) * BINARY: just stores the byte[] for each document (no deduplication, no compression, no ordinals, nothing). So for sorting/faceting: BINARY is generally not very efficient unless there is something custom going on: for example lucene's faceting package stores the values elsewhere in a separate taxonomy index, so it uses this type just to encode a delta-compressed ordinal list for each document. For scoring factors/function queries: encoding the values inside NUMERIC(s) [up to 64 bits each] might still be best on average: the compression applied here is surprisingly efficient. -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec
DocValues for byte[] ... or a common codec for selected fields
Hi all! First of all: Solr is an amazing project. Big thanks to the community! I really appreciate the stability, and especially the pre-configured jetty example ;) And now for the question: I'm currently on my way to writing a RequestHandler for Solr that deals with content based image search (using Lire https://code.google.com/p/lire/). In General everything is running fine, but ... As soon as I hit a virtual border, say 1.5 million images or a certain index size around 2GB, I'm experiencing performance drops. I know from my experience with Lucene and some profiling with Lucene that this can be caused by the compression of stored fields. I'm currently using binary fields to store byte[] objects, which are used after a hash based search for re-ranking. So based on the hashes a term query is issued in the request handler, then the 500-3000 documents (candidate results) are read from the index and the byte[] data is used to re-rank the candidate results. My question is now: Until now I just found ways to add single byte values as DocValues to the index, but not a whole binary fields. Do you have any idea where to start if I want to put my binary fields into DocValues? cheers, Mathias -- Dr. Mathias Lux Assistant Professor, Klagenfurt University, Austria http://tinyurl.com/mlux-itec