LIRE Solr plugin updated to 4.10.2 and new demo ...

2014-12-17 Thread Mathias Lux
Hi all!

After the initial release I finally came around to update the content based
image retrieval plugin LIRE Solr to the current version and it has been
extended to support more CBIR features.

https://bitbucket.org/dermotte/liresolr

I also took the freedom to update the web client and the demo installation,
so feel free to try it, it's now running on a shared server featuring a
million of photos. The new features are the image similarity based
re-ranking -- you first search for a tag in a text box and then re-rank the
result based on an example image -- and a new way of content based search
utilizing the standard handler but scoring with a funtcion (it's slower,
but more flexible).

Feel free to test it and let me know what you think :)

http://demo-itec.uni-klu.ac.at/liredemo/

cheers,
  Mathias

-- 
Priv.-Doz. Dr. Dipl.-Ing. Mathias Lux
Associate Professor at Klagenfurt University, Austria
http://tinyurl.com/mlux-itec ... contact and cv


Re: DataImport Handler, writing a new EntityProcessor

2013-12-19 Thread Mathias Lux
Hi!

Thanks for all the advice! I finally did it, the most annoying error
that took me the best of a day to figure out was that the state
variable here had to be reset:
https://bitbucket.org/dermotte/liresolr/src/d27878a71c63842cb72b84162b599d99c4408965/src/main/java/net/semanticmetadata/lire/solr/LireEntityProcessor.java?at=master#cl-56

The EntityProcessor is part of this image search plugin if anyone is
interested: https://bitbucket.org/dermotte/liresolr/

:) It's always the small things that are hard to find

cheers and thanks, Mathias

On Wed, Dec 18, 2013 at 7:26 PM, P Williams
williams.tricia.l...@gmail.com wrote:
 Hi Mathias,

 I'd recommend testing one thing at a time.  See if you can get it to work
 for one image before you try a directory of images.  Also try testing using
 the solr-testframework using your ide (I use Eclipse) to debug rather than
 your browser/print statements.  Hopefully that will give you some more
 specific knowledge of what's happening around your plugin.

 I also wrote an EntityProcessor plugin to read from a properties
 filehttps://issues.apache.org/jira/browse/SOLR-3928.
  Hopefully that'll give you some insight about this kind of Solr plugin and
 testing them.

 Cheers,
 Tricia




 On Wed, Dec 18, 2013 at 3:03 AM, Mathias Lux m...@itec.uni-klu.ac.atwrote:

 Hi all!

 I've got a question regarding writing a new EntityProcessor, in the
 same sense as the Tika one. My EntityProcessor should analyze jpg
 images and create document fields to be used with the LIRE Solr plugin
 (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
 same approach as the TikaEntityProcessor, but my setup just indexes
 the first of 1000 images. I'm using a FileListEntityProcessor to get
 all JPEGs from a directory and then I'm handing them over (see [2]).
 My code for the EntityProcessor is at [1]. I've tried to use the
 DataSource as well as the filePath attribute, but it ends up all the
 same. However, the FileListEntityProcessor is able to read all the
 files according to the debug output, but I'm missing the link from the
 FileListEntityProcessor to the LireEntityProcessor.

 I'd appreciate any pointer or help :)

 cheers,
   Mathias

 [1] LireEntityProcessor http://pastebin.com/JFajkNtf
 [2] dataConfig http://pastebin.com/vSHucatJ

 --
 Dr. Mathias Lux
 Klagenfurt University, Austria
 http://tinyurl.com/mlux-itec




-- 
PD Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


DataImport Handler, writing a new EntityProcessor

2013-12-18 Thread Mathias Lux
Hi all!

I've got a question regarding writing a new EntityProcessor, in the
same sense as the Tika one. My EntityProcessor should analyze jpg
images and create document fields to be used with the LIRE Solr plugin
(https://bitbucket.org/dermotte/liresolr). Basically I've taken the
same approach as the TikaEntityProcessor, but my setup just indexes
the first of 1000 images. I'm using a FileListEntityProcessor to get
all JPEGs from a directory and then I'm handing them over (see [2]).
My code for the EntityProcessor is at [1]. I've tried to use the
DataSource as well as the filePath attribute, but it ends up all the
same. However, the FileListEntityProcessor is able to read all the
files according to the debug output, but I'm missing the link from the
FileListEntityProcessor to the LireEntityProcessor.

I'd appreciate any pointer or help :)

cheers,
  Mathias

[1] LireEntityProcessor http://pastebin.com/JFajkNtf
[2] dataConfig http://pastebin.com/vSHucatJ

-- 
Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: DataImport Handler, writing a new EntityProcessor

2013-12-18 Thread Mathias Lux
Unfortunately it is the same in non-debug, just the first document. I
also output the params to sout, but it seems only the first one is
ever arriving at my custom class. I've the feeling that I'm doing
something seriously wrong here, based on a complete misunderstanding
:) I basically assume that the nested entity processor will be called
for each of the rows that come out from its parent. I've read
somewhere, that the data has to be taken from the data source, and
I've implemented that, but it doesn't seem to change anything.

cheers,
Mathias

On Wed, Dec 18, 2013 at 3:05 PM, Dyer, James
james.d...@ingramcontent.com wrote:
 The first thing I would suggest is to try and run it not in debug mode.  
 DIH's debug mode limits the number of documents it will take in, so that 
 might be all that is wrong here.

 James Dyer
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: mathias@gmail.com [mailto:mathias@gmail.com] On Behalf Of 
 Mathias Lux
 Sent: Wednesday, December 18, 2013 4:04 AM
 To: solr-user@lucene.apache.org
 Subject: DataImport Handler, writing a new EntityProcessor

 Hi all!

 I've got a question regarding writing a new EntityProcessor, in the
 same sense as the Tika one. My EntityProcessor should analyze jpg
 images and create document fields to be used with the LIRE Solr plugin
 (https://bitbucket.org/dermotte/liresolr). Basically I've taken the
 same approach as the TikaEntityProcessor, but my setup just indexes
 the first of 1000 images. I'm using a FileListEntityProcessor to get
 all JPEGs from a directory and then I'm handing them over (see [2]).
 My code for the EntityProcessor is at [1]. I've tried to use the
 DataSource as well as the filePath attribute, but it ends up all the
 same. However, the FileListEntityProcessor is able to read all the
 files according to the debug output, but I'm missing the link from the
 FileListEntityProcessor to the LireEntityProcessor.

 I'd appreciate any pointer or help :)

 cheers,
   Mathias

 [1] LireEntityProcessor http://pastebin.com/JFajkNtf
 [2] dataConfig http://pastebin.com/vSHucatJ

 --
 Dr. Mathias Lux
 Klagenfurt University, Austria
 http://tinyurl.com/mlux-itec




-- 
PD Dr. Mathias Lux
Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Query result caching with custom functions

2013-11-24 Thread Mathias Lux
Hi Joel,

I just tested with custom equals and hashcode ... what I basically did
is that I created a string object based on all the function values and
used this for the equals (with an instanceof) and for the hash method.

The result was quite the same as before, all treszults are cashed
unless I set the queryResultCache size to 0 in the solrconfig.xml

cheers,
Mathias

On Thu, Oct 24, 2013 at 4:51 PM, Joel Bernstein joels...@gmail.com wrote:
 Mathias,

 I'd have to do a close review of the function sort code to be sure, but I
 suspect if you implement the equals() method on the ValueSource it should
 solve your caching issue. Also implement hashCode().

 Joel


 On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote:

 On 10/24/2013 5:35 AM, Mathias Lux wrote:
  I've written a custom function, which is able to provide a distance
  based on some DocValues to re-sort result lists. This basically works
  great, but we've got the problem that if I don't change the query, but
  the function parameters, Solr delivers a cached result without
  re-ordering. I turned off caching and see there, problem solved. But
  of course this is not a avenue I want to pursue further as it doesn't
  make sense for a prodcutive system.
 
  Do you have any ideas (beyond fake query modification and turning off
  caching) to counteract?
 
  btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
  been resolved in 4.5 I'll port it :) The code I'm using is at
  https://bitbucket.org/dermotte/liresolr

 I suspect that the queryResultCache is not paying attention to the fact
 that parameters for your plugin have changed.  This probably means that
 your plugin must somehow inform the cache check code that something
 HAS changed.

 How you actually do this is a mystery to me because it involves parts of
 the code that are beyond my understanding, but it MIGHT involve making
 sure that parameters related to your code are saved as part of the entry
 that goes into the cache.

 Thanks,
 Shawn





-- 
PD Dr. Mathias Lux
Associate Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Query result caching with custom functions

2013-10-24 Thread Mathias Lux
Hi all!

Got a question on the Solr cache :)

I've written a custom function, which is able to provide a distance
based on some DocValues to re-sort result lists. This basically works
great, but we've got the problem that if I don't change the query, but
the function parameters, Solr delivers a cached result without
re-ordering. I turned off caching and see there, problem solved. But
of course this is not a avenue I want to pursue further as it doesn't
make sense for a prodcutive system.

Do you have any ideas (beyond fake query modification and turning off
caching) to counteract?

btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
been resolved in 4.5 I'll port it :) The code I'm using is at
https://bitbucket.org/dermotte/liresolr

regards,
Mathias

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Query result caching with custom functions

2013-10-24 Thread Mathias Lux
That's a possibility,  I'll try that and report on the effects.  Thanks,
Mathias
Am 24.10.2013 16:52 schrieb Joel Bernstein joels...@gmail.com:

 Mathias,

 I'd have to do a close review of the function sort code to be sure, but I
 suspect if you implement the equals() method on the ValueSource it should
 solve your caching issue. Also implement hashCode().

 Joel


 On Thu, Oct 24, 2013 at 10:35 AM, Shawn Heisey s...@elyograg.org wrote:

  On 10/24/2013 5:35 AM, Mathias Lux wrote:
   I've written a custom function, which is able to provide a distance
   based on some DocValues to re-sort result lists. This basically works
   great, but we've got the problem that if I don't change the query, but
   the function parameters, Solr delivers a cached result without
   re-ordering. I turned off caching and see there, problem solved. But
   of course this is not a avenue I want to pursue further as it doesn't
   make sense for a prodcutive system.
  
   Do you have any ideas (beyond fake query modification and turning off
   caching) to counteract?
  
   btw. I'm using Solr 4.4 (so if you are aware of the issue and it has
   been resolved in 4.5 I'll port it :) The code I'm using is at
   https://bitbucket.org/dermotte/liresolr
 
  I suspect that the queryResultCache is not paying attention to the fact
  that parameters for your plugin have changed.  This probably means that
  your plugin must somehow inform the cache check code that something
  HAS changed.
 
  How you actually do this is a mystery to me because it involves parts of
  the code that are beyond my understanding, but it MIGHT involve making
  sure that parameters related to your code are saved as part of the entry
  that goes into the cache.
 
  Thanks,
  Shawn
 
 



Re: Re-Ranking results based on DocValues with custom function.

2013-09-18 Thread Mathias Lux
Got it! Just for you to share ... and maybe for inclusion in the Java
API docs of ValueSource :)

For sorting one needs to implement the method

public double doubleVal(int) of the class ValueSource

then it works like a charm.

cheers,
  Mathias

On Tue, Sep 17, 2013 at 6:28 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : It basically allows for searching for text (which is associated to an
 : image) in an index and then getting the distance to a sample image
 : (base64 encoded byte[] array) based on one of five different low level
 : content based features stored as DocValues.

 very cool.

 : So there one little tiny question I still have ;) When I'm trying to
 : do a sort I'm getting
 :
 : msg: sort param could not be parsed as a query, and is not a field
 : that exists in the index:
 : lirefunc(cl_hi,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=),
 :
 : for the call 
 http://localhost:9000/solr/lire/select?q=*%3A*sort=lirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)+ascfl=id%2Ctitle%2Clirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)wt=jsonindent=true

 Hmmm...

 i think the crux of the issue is your string literal.  function parsing
 tries to make live easy for you by not requiring string literals to be
 quoted unless they conflict with other function names or field names
 etc  on top of that the sort parsing code is kind of hueristic based
 (because it has to account for both functions or field names or wildcards,
 followed by other sort clauses, etc...) so in that context the special
 characters like '=' in your base64 string literal might be confusing hte
 hueristics.

 can you try to quote the string literal it and see if that works?

 For example, when i try using strdist with your base64 string in a sort
 param using the example configs i get the same error...

 http://localhost:8983/solr/select?q=*:*sort=strdist%28name,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=,jw%29+asc

 but if i quote the string literal it works fine...

 http://localhost:8983/solr/select?q=*:*sort=strdist%28name,%27FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=%27,jw%29+asc



 -Hoss



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Re-Ranking results based on DocValues with custom function.

2013-09-17 Thread Mathias Lux
Hi!

Thanks for the directions! I got it up and running with a custom
ValueSourceParser: http://pastebin.com/cz1rJn4A and a custom
ValueSource: http://pastebin.com/j8mhA8e0

It basically allows for searching for text (which is associated to an
image) in an index and then getting the distance to a sample image
(base64 encoded byte[] array) based on one of five different low level
content based features stored as DocValues.

A sample result is here: http://pastebin.com/V7kL3DJh

So there one little tiny question I still have ;) When I'm trying to
do a sort I'm getting

msg: sort param could not be parsed as a query, and is not a field
that exists in the index:
lirefunc(cl_hi,FQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA=),

for the call 
http://localhost:9000/solr/lire/select?q=*%3A*sort=lirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)+ascfl=id%2Ctitle%2Clirefunc(cl_hi%2CFQY5DhMYDg0ODg0PEBEPDg4ODg8QEgsgEBAQEBAgEBAQEBA%3D)wt=jsonindent=true

cheers,
  Mathias

On Tue, Sep 17, 2013 at 1:01 AM, Chris Hostetter
hossman_luc...@fucit.org wrote:
 : dissimilarity functions). What I want to do is to search using common
 : text search and then (optionally) re-rank using some custom function
 : like
 :
 : http://localhost:8983/solr/select?q=*:*sort=myCustomFunction(var1) asc

 can you describe what you want your custom function to look like? it may
 already be possible using the existing functions provided out of hte box -
 just neeed to combine them to build up the mathc expression...

 https://wiki.apache.org/solr/FunctionQuery

 ...if you really want to write your own, just implement ValueSourceParser
 and register it in solrconfig.xml...

 https://wiki.apache.org/solr/SolrPlugins#ValueSourceParser

 : I've seen that there are hooks in solrconfig.xml, but I did not find
 : an example or some documentation. I'd be most grateful if anyone could
 : either point me to one or give me a hint for another way to go :)

 when writing a custom plugin like this, the best thing to do is look at
 the existing examples of that plugin.  almost all of hte built in
 ValueSourceParsers are really trivial, and can be found in tiny anonymous
 classes right inside the ValueSourceParser.java...

 For example, the function ot divide the results of two other fnctions...

 addParser(div, new ValueSourceParser() {
   @Override
   public ValueSource parse(FunctionQParser fp) throws SyntaxError {
 ValueSource a = fp.parseValueSource();
 ValueSource b = fp.parseValueSource();
 return new DivFloatFunction(a, b);
   }
 });

 ..or, if you were trying to bundle that up in your own plugin jar and
 register it in solrconfig.xml, you might write it something like...

 public class DivideValueSourceParser extends ValueSourceParser {
   public DivideValueSourceParser() { }
   public ValueSource parse(FunctionQParser fp) throws SyntaxError {
 ValueSource a = fp.parseValueSource();
 ValueSource b = fp.parseValueSource();
 return new DivFloatFunction(a, b);
   }
 }

 and then register it as...

 valueSourceParser name=div class=com.you.DivideValueSourceParser /


 depending on your needs, you may also want to write a custom ValueSource
 implementation (ie: instead of DivFloatFunction above) in which case,
 again, the best examples to look at are all of the existing ValueSource
 functions...

 https://lucene.apache.org/core/4_4_0/queries/org/apache/lucene/queries/function/ValueSource.html


 -Hoss



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Scoring by document size

2013-09-17 Thread Mathias Lux
As the IDF values for A, B and C are minimal (couldn't get any worse
than being in any document), the major part of your score comes most
likely from the coord(..) part of scoring - which basically computes
the overlap of the query and the document. If you want to have a
stronger influence you can extend and override the Similarity
implementation. You might take a look at
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

cheers,
  Mathias

On Tue, Sep 17, 2013 at 1:59 PM, Upayavira u...@odoko.co.uk wrote:
 Have you used debugQuery=true, or fl=*,[explain], or those various
 functions? It is possible to ask Solr to tell you how it calculated the
 score, which will enable you to see what is going on in each case. You
 can probably work it out for yourself then I suspect.

 Upayavira

 On Tue, Sep 17, 2013, at 08:40 AM, blopez wrote:
 Hi all,

 I have some doubts about the Solr scoring function. I'm using all default
 configuration, but I'm facing a wired issue with the retrieved scores.

 In the schema, I'm going to focus in the only field I'm interested in.
 Its
 definition is:

 *fieldType name=text class=solr.TextField sortMissingLast=true
 omitNorms=false
   analyzer type=index
   tokenizer 
 class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter 
 class=solr.ASCIIFoldingFilterFactory/
   /analyzer
   analyzer type=query
   tokenizer 
 class=solr.WhitespaceTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter 
 class=solr.ASCIIFoldingFilterFactory/
   /analyzer
 /fieldType

 field name=myField type=text indexed=true stored=true
 required=false /*

 (omitNorms=false, if not, the document size is not taken into account
 to
 the final score)

 Then, I index some documents, with the following text in the 'myField'
 field:

 doc1 = A B C
 doc2 = A B C D
 doc3 = A B C D E
 doc4 = A B C D E F
 doc5 = A B C D E F G H
 doc6 = A B C D E F G H I

 Finally, I perform the query 'myField:(A B C)' in order to recover
 all
 the documents, but with different scoring (doc1 is more similar to the
 query
 than doc2, which is more similar than doc3, ...).

 All the documents are retrieved (OK), but the scores are like this:

 *doc1 = 2,590214
 doc2 = 2,590214*
 doc3 = 2,266437
 *doc4 = 1,94266
 doc5 = 1,94266*
 doc6 = 1,618884

 So in conclussion, as you can see the score goes down, but not the way
 I'd
 like. Doc1 is getting the same scoring than Doc2, even when Doc1 matches
 3/3
 tokens, and Doc2 matches 3/4 tokens.

 Is this the normal Solr behaviour? Is there any way to get my expected
 behaviour?

 Thanks a lot,
 Borja.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Scoring-by-document-size-tp4090523.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re-Ranking results based on DocValues with custom function.

2013-09-16 Thread Mathias Lux
Hi!

I'm having quite an index with a lot of text and some binary data in
the documents (numeric vectors of arbitrary size with associated
dissimilarity functions). What I want to do is to search using common
text search and then (optionally) re-rank using some custom function
like

http://localhost:8983/solr/select?q=*:*sort=myCustomFunction(var1) asc

I've seen that there are hooks in solrconfig.xml, but I did not find
an example or some documentation. I'd be most grateful if anyone could
either point me to one or give me a hint for another way to go :)

Btw. Using just the DocValues for search is handled by a custom
RequestHandler, which works great, but using text as a main search
feature, and my DocValues for re-ranking,  I'd rather just add a
function for sorting and use the current, stable and well performing
request handler.

cheers,
Mathias

ps. a demo of the current system is available at:
http://demo-itec.uni-klu.ac.at/liredemo/

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Mathias Lux
Hi!

I'm basically searching for a method to put byte[] data into Lucene
DocValues of type BINARY (see [1]). Currently only primitives and
Strings are supported according to [1].

I know that this can be done with a custom update handler, but I'd
like to avoid that.

cheers,
Mathias

[1] http://wiki.apache.org/solr/DocValues

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Mathias Lux
Hi!

That's what I'm doing currently, but it ends up in StoredField
implementations, which create an overhead on decompression I want to
avoid.

cheers,
Mathias

On Mon, Aug 12, 2013 at 3:11 PM, Raymond Wiker rwi...@gmail.com wrote:
 base64-encode the binary data? That will give you strings, at the expense
 of some storage overhead.


 On Mon, Aug 12, 2013 at 2:38 PM, Mathias Lux m...@itec.uni-klu.ac.atwrote:

 Hi!

 I'm basically searching for a method to put byte[] data into Lucene
 DocValues of type BINARY (see [1]). Currently only primitives and
 Strings are supported according to [1].

 I know that this can be done with a custom update handler, but I'd
 like to avoid that.

 cheers,
 Mathias

 [1] http://wiki.apache.org/solr/DocValues

 --
 Dr. Mathias Lux
 Assistant Professor, Klagenfurt University, Austria
 http://tinyurl.com/mlux-itec




-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Mathias Lux
Hi Robert,

I'm basically mis-using Solr for content based image search. So I
have indexed fields (hashes) for candidate selection, i.e. 1,500
candidate results retrieved with the IndexSearcher by hashes, which I
then have to re-rank based on numeric vectors I'm storing in byte[]
arrays. I had an implementation, where this is based on the binary
field but reading from an index with a lot of small stored field is
not a good idea with the current compression approach (I've already
discussed this in the Lucene user group :) BINARY is the thing for me
to go for, as you said, there's nothing, just the values.

Another thing for not using the the SORTED_SET and SORTED
implementations is, that Solr currently works with Strings on that and
I want to have a small memory footprint for millions of images ...
which does not go well with immutables.

However, I now already have a solution, which I just wanted to post
here when I saw your answer. Basically I copied the source from the
BinaryField and changed it to a BinaryDocValuesField (see line 68 at
http://pastebin.com/dscPTwhr). This works out well for indexing when
you adapt the schema to use this class:

[...]
!-- ColorLayout --
field name=cl_ha type=text_ws indexed=true stored=false
required=false/
field name=cl_hi type=binaryDV  indexed=false stored=true
required=false/
[...]
fieldtype name=binaryDV
class=net.semanticmetadata.lire.solr.BinaryDocValuesField/
[...]

I then have a custom request handler, that does the search for me.
First based on the hashes (field cl_ha, treated as whitespace
delimited terms) and then re-ranking the 1,500 first results based on
the DocValues.

Now it works rather fast, a demo with 1M images is available at
http://demo-itec.uni-klu.ac.at/liredemo/ .. hash based search time is
still not optimal, but that's an issue of the distribution of terms,
which is not optimal for this kind of index (find the runtime
separated in search  re-rank at the end of the page).

I'll put the whole (open, GPL-ed) source online at the end of
September (as module of LIRE), after some stress tests, documentation
and further bug fixing.

cheers,
  Mathias

On Mon, Aug 12, 2013 at 4:51 PM, Robert Muir rcm...@gmail.com wrote:
 On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux m...@itec.uni-klu.ac.at wrote:
 Hi!

 I'm basically searching for a method to put byte[] data into Lucene
 DocValues of type BINARY (see [1]). Currently only primitives and
 Strings are supported according to [1].

 I know that this can be done with a custom update handler, but I'd
 like to avoid that.


 Can you describe a little bit what kind of operations you want to do with it?
 I don't really know how BinaryField is typically used, but maybe it
 could support this option. On the other hand adding it to BinaryField
 might not buy you much without some additional stuff depending upon
 what you need to do. Like if you really want to do sort/facet on the
 thing, SORTED(SET) would probably be a better implementation: it
 doesnt care that the values are binary.

 BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
 * SORTED: deduplicates/compresses the unique byte[]'s and gives each
 document an ordinal number that reflects sort order (for
 sorting/faceting/grouping/etc)
 * SORTED_SET: similar, except each document has a set (which can be
 empty), of ordinal numbers (e.g. for faceting multivalued fields)
 * BINARY: just stores the byte[] for each document (no deduplication,
 no compression, no ordinals, nothing).

 So for sorting/faceting: BINARY is generally not very efficient unless
 there is something custom going on: for example lucene's faceting
 package stores the values elsewhere in a separate taxonomy index, so
 it uses this type just to encode a delta-compressed ordinal list for
 each document.

 For scoring factors/function queries: encoding the values inside
 NUMERIC(s) [up to 64 bits each] might still be best on average: the
 compression applied here is surprisingly efficient.



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec


DocValues for byte[] ... or a common codec for selected fields

2013-08-08 Thread Mathias Lux
Hi all!

First of all: Solr is an amazing project. Big thanks to the community!
I really appreciate the stability, and especially the pre-configured
jetty example ;)

And now for the question: I'm currently on my way to writing a
RequestHandler for Solr that deals with content based image search
(using Lire https://code.google.com/p/lire/). In General everything is
running fine, but ...

As soon as I hit a virtual border, say 1.5 million images or a certain
index size around 2GB, I'm experiencing performance drops. I know from
my experience with Lucene and some profiling with Lucene that this can
be caused by the compression of stored fields. I'm currently using
binary fields to store byte[] objects, which are used after a hash
based search for re-ranking. So based on the hashes a term query is
issued in the request handler, then the 500-3000 documents (candidate
results) are read from the index and the byte[] data is used to
re-rank the candidate results.

My question is now: Until now I just found ways to add single byte
values as DocValues to the index, but not a whole binary fields. Do
you have any idea where to start if I want to put my binary fields
into DocValues?

cheers,
  Mathias

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec