Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ryan McKinley
You can store binary data using a binary field type -- then you need
to send the data base64 encoded.

I would strongly recommend against storing large binary files in solr
-- unless you really don't care about performance -- the file system
is a good option that springs to mind.

ryan




2011/4/6 Ezequiel Calderara ezech...@gmail.com:
 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com




 --
 __
 Ezequiel.

 Http://www.ironicnet.com



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind
I put binary data in an ordinary Solr stored field, don't need any 
special schema.


I have run into trouble making sure the data is not corrupted on the way 
in during indexing, depending on exactly what form of communication is 
being used to index (SolrJ, SolrJ with EmbeddedSolr, DIH, etc.), as well 
as settings in the container (eg jetty or tomcat) used to house Solr.   
But I think it's possible to get it working no matter what the path, if 
you run into trouble someone may be able to help you.


My binary data is not very large though (generally under 1 meg).

However, in general, _indexing_ large data should be fine, although it 
will create a larger index which can require more RAM, or be slower, 
etc.  But that's geenrally just a function of total size of index, or 
really total number of unique terms, doesn't matter if the docs they 
come from are big or small.


_Storing_ large fields can sometimes be a problem, lucene/Solr are 
really optimized as an index, not a key/value store.  Some people choose 
to _store_ their large objects in some external store (rdbms, nosql 
key/value, whatever), and have the client application look up the 
objects themselves by primary-key/unique-id, after the pk/uid's 
themselves are retrieved from Solr. Use Solr for what it's good at, 
indexing, use something else good at storing for storing large objects.  
But other people sometimes store large objects directly in Solr without 
problems, can depend on the exact nature of your index and use.


On 4/6/2011 2:09 PM, Ezequiel Calderara wrote:

Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderaraezech...@gmail.com


Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

--
__
Ezequiel.

Http://www.ironicnet.com






Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind

Ha, there's a binary field type?!

I've stored binary data in an ordinary String field type, and it's 
worked.  But there were some headaches to get it to work, might have 
been smoother if I had realized there was actually a binary field type.


But wait I'm talking about Solr 'stored field', not about indexing. I 
didn't try to index my binary data, just store it for later retrieval 
(knowing this can sometimes be a performance problem, doing it anyway 
with relatively small data, got away with it).  Does the field type even 
effect the _stored values_ in a Solr field?


On 4/6/2011 2:25 PM, Ryan McKinley wrote:

You can store binary data using a binary field type -- then you need
to send the data base64 encoded.

I would strongly recommend against storing large binary files in solr
-- unless you really don't care about performance -- the file system
is a good option that springs to mind.

ryan




2011/4/6 Ezequiel Calderaraezech...@gmail.com:

Another question that maybe is easier to answer, how can i store binary
data? Any example schema?

2011/4/6 Ezequiel Calderaraezech...@gmail.com


Hello everyone, i need to know if some has used solr for indexing and
storing images (upt to 16MB) or binary docs.

How does solr behaves with this type of docs? How affects performance?

Thanks Everyone

--
__
Ezequiel.

Http://www.ironicnet.com




--
__
Ezequiel.

Http://www.ironicnet.com



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
Hi, your answers were really helpfull

I was thinking in putting the base64 encoded file into a string field. But
was a little worried about solr trying to stem it or vectorize or those
stuff.

Seen in the example of the schema.xml:
!--Binary data type. The data should be sent/retrieved in as Base64
encoded Strings --
fieldtype name=binary class=solr.BinaryField/

Anyone knows any storage for images that performs well, other than FS ?

Thanks


On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote:

 Ha, there's a binary field type?!

 I've stored binary data in an ordinary String field type, and it's
 worked.  But there were some headaches to get it to work, might have been
 smoother if I had realized there was actually a binary field type.

 But wait I'm talking about Solr 'stored field', not about indexing. I
 didn't try to index my binary data, just store it for later retrieval
 (knowing this can sometimes be a performance problem, doing it anyway with
 relatively small data, got away with it).  Does the field type even effect
 the _stored values_ in a Solr field?


 On 4/6/2011 2:25 PM, Ryan McKinley wrote:

 You can store binary data using a binary field type -- then you need
 to send the data base64 encoded.

 I would strongly recommend against storing large binary files in solr
 -- unless you really don't care about performance -- the file system
 is a good option that springs to mind.

 ryan




 2011/4/6 Ezequiel Calderaraezech...@gmail.com:

 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderaraezech...@gmail.com

  Hello everyone, i need to know if some has used solr for indexing and
 storing images (upt to 16MB) or binary docs.

 How does solr behaves with this type of docs? How affects performance?

 Thanks Everyone

 --
 __
 Ezequiel.

 Http://www.ironicnet.com



 --
 __
 Ezequiel.

 Http://www.ironicnet.com




-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 Ha, there's a binary field type?!
 
 I've stored binary data in an ordinary String field type, and it's
 worked.  But there were some headaches to get it to work, might have
 been smoother if I had realized there was actually a binary field type.

How, you can't just embed control characters in an XML body? The need to be at 
least encoded as not to write tabs, deletes, backspaces and whatever garbage, 
base64 in Solr's case.
 
 But wait I'm talking about Solr 'stored field', not about indexing. I
 didn't try to index my binary data, just store it for later retrieval
 (knowing this can sometimes be a performance problem, doing it anyway
 with relatively small data, got away with it).  Does the field type even
 effect the _stored values_ in a Solr field?

Solr decodes the data and stores it. It reencodes the data when writing a 
response.

 
 On 4/6/2011 2:25 PM, Ryan McKinley wrote:
  You can store binary data using a binary field type -- then you need
  to send the data base64 encoded.
  
  I would strongly recommend against storing large binary files in solr
  -- unless you really don't care about performance -- the file system
  is a good option that springs to mind.
  
  ryan
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com:
  Another question that maybe is easier to answer, how can i store binary
  data? Any example schema?
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com
  
  Hello everyone, i need to know if some has used solr for indexing and
  storing images (upt to 16MB) or binary docs.
  
  How does solr behaves with this type of docs? How affects performance?
  
  Thanks Everyone
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 Hi, your answers were really helpfull
 
 I was thinking in putting the base64 encoded file into a string field. But
 was a little worried about solr trying to stem it or vectorize or those
 stuff.

String field types are not analyzed. So it doesn't brutalize your data. Better 
use BinaryField.

 
 Seen in the example of the schema.xml:
 !--Binary data type. The data should be sent/retrieved in as Base64
 encoded Strings --
 fieldtype name=binary class=solr.BinaryField/
 
 Anyone knows any storage for images that performs well, other than FS ?

CouchDB can deliver file attachments over HTTP. It needs to be sent encoded (of 
course).

 
 Thanks
 
 On Wed, Apr 6, 2011 at 3:31 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
  Ha, there's a binary field type?!
  
  I've stored binary data in an ordinary String field type, and it's
  worked.  But there were some headaches to get it to work, might have been
  smoother if I had realized there was actually a binary field type.
  
  But wait I'm talking about Solr 'stored field', not about indexing. I
  didn't try to index my binary data, just store it for later retrieval
  (knowing this can sometimes be a performance problem, doing it anyway
  with relatively small data, got away with it).  Does the field type even
  effect the _stored values_ in a Solr field?
  
  On 4/6/2011 2:25 PM, Ryan McKinley wrote:
  You can store binary data using a binary field type -- then you need
  to send the data base64 encoded.
  
  I would strongly recommend against storing large binary files in solr
  -- unless you really don't care about performance -- the file system
  is a good option that springs to mind.
  
  ryan
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com:
  Another question that maybe is easier to answer, how can i store binary
  data? Any example schema?
  
  2011/4/6 Ezequiel Calderaraezech...@gmail.com
  
   Hello everyone, i need to know if some has used solr for indexing and
   
  storing images (upt to 16MB) or binary docs.
  
  How does solr behaves with this type of docs? How affects performance?
  
  Thanks Everyone
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com
  
  --
  __
  Ezequiel.
  
  Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:39 PM, Markus Jelsma wrote:

Ha, there's a binary field type?!

I've stored binary data in an ordinary String field type, and it's
worked.  But there were some headaches to get it to work, might have
been smoother if I had realized there was actually a binary field type.

How, you can't just embed control characters in an XML body? The need to be at
least encoded as not to write tabs, deletes, backspaces and whatever garbage,
base64 in Solr's case.


In my case using SolrJ with BinaryUpdateHandler. I think. That code was 
actually written by someone else, a while ago.


However I've managed to do it at indexing -- ultimately getting it into 
a String-type stored field -- my binary data comes back not UUEncoded, 
but XML-escaped, ie:


#30;

This works for me because my binary data is actually MOSTLY ascii (so 
this isn't as terribly inefficient as it could be), but it has some 
control characters in it that need to be preserved. And nearly any 
library you use for consuming XML responses will properly un-escape 
things like #30; when reading.


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Adam Estrada
Well...by default there is a pretty decent schema that you can use as a
template in the example project that builds with Solr. Tika is the library
that does the actual content extraction so it would be a good idea to try
the example project out first.

Adam

2011/4/6 Ezequiel Calderara ezech...@gmail.com

 Another question that maybe is easier to answer, how can i store binary
 data? Any example schema?

 2011/4/6 Ezequiel Calderara ezech...@gmail.com

  Hello everyone, i need to know if some has used solr for indexing and
  storing images (upt to 16MB) or binary docs.
 
  How does solr behaves with this type of docs? How affects performance?
 
  Thanks Everyone
 
  --
  __
  Ezequiel.
 
  Http://www.ironicnet.com
 



 --
 __
 Ezequiel.

 Http://www.ironicnet.com



Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Stefan Matheis

Ezequiel,

Am 06.04.2011 20:38, schrieb Ezequiel Calderara:

Anyone knows any storage for images that performs well, other than FS ?


you may have a look on http://www.danga.com/mogilefs/ ? :)

Regards
Stefan


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Ezequiel Calderara
On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada estrada.adam.gro...@gmail.com
 wrote:

 Well...by default there is a pretty decent schema that you can use as a
 template in the example project that builds with Solr. Tika is the library
 that does the actual content extraction so it would be a good idea to try
 the example project out first.


I wanted to know how large field's size affects performance.

But i wasn't sure how to design the schema.


On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis 
matheis.ste...@googlemail.com wrote:

 Ezequiel,

 Am 06.04.2011 20:38, schrieb Ezequiel Calderara:

  Anyone knows any storage for images that performs well, other than FS ?


 you may have a look on http://www.danga.com/mogilefs/ ? :)

 Regards
 Stefan


Stefan, we looked at mogilefs, also couchdb and mongodb.
AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using
microsoft as the OS. (yeah, we are the open source evangelist in our
company :P)

Just for the moment we well start using Solr for storing and indexing (some
info at least) images and docs. We have yet to see what are the needs in
terms of scalability to choose between the options.

Thanks all...
If you have more info send it :)

-- 
__
Ezequiel.

Http://www.ironicnet.com


Re: Solr: Images, Docs and Binary data

2011-04-06 Thread Markus Jelsma

 On Wed, Apr 6, 2011 at 15:31 PM, Adam Estrada
 estrada.adam.gro...@gmail.com
 
 I wanted to know how large field's size affects performance.

If you use replication then it's a huge impact on performance as the data gets 
sent over the network. It's also a memory hog so there's less memory and more 
garbage collection. Indexing and merging is slower because of additional bytes 
being copied. If there's a lot of binary data and performance is important and 
diskspace is not a commodity then you shouldn't store it in the index; the 
index size can double during optimizing.

 
 But i wasn't sure how to design the schema.
 
 
 On Wed, Apr 6, 2011 at 4:23 PM, Stefan Matheis 
 
 matheis.ste...@googlemail.com wrote:
  Ezequiel,
  
  Am 06.04.2011 20:38, schrieb Ezequiel Calderara:
   Anyone knows any storage for images that performs well, other than FS ?
  
  you may have a look on http://www.danga.com/mogilefs/ ? :)
  
  Regards
  Stefan
 
 Stefan, we looked at mogilefs, also couchdb and mongodb.
 AFAIR (As Far as I Read :P), mogilefs runs on *nix OS, while we are using
 microsoft as the OS. (yeah, we are the open source evangelist in our
 company :P)
 
 Just for the moment we well start using Solr for storing and indexing (some
 info at least) images and docs. We have yet to see what are the needs in
 terms of scalability to choose between the options.
 
 Thanks all...
 If you have more info send it :)