Re: DIH Blob data

2014-11-14 Thread Anurag Sharma
bq: We routinely store images and pdfs in Solr. There *is* a benefit, since
you don't need to manage another storage system, you don't have to worry
about Solr getting out of sync with the other system, you can use Solr
replication for all your assets, etc.

Do the same holds good for large Blobs like image, audio, video as well?
Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html)
but not sure how good is the Solr/Tika combination. Storing pdf and other
docs could be useful in Solr, tika can extract metadata from the docs and
make them discoverable.

Considering all the above cases there should also be a support for File
field type in Solr like other types Date, Float, Int, Long, String etc. but
looks like there are only two file types (
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/)
and both re external file storage.

   - ExternalFileField.java
   
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java
   - ExternalFileFieldReloader.java
   
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java

What type can be used in schema when storing the files internally?


On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote:

 How about this?

 First, define a field for filter query. It should be multivalued.

 Second, implements transformer to extract json dynamic fields, and put the
 dynamic fields into the solr field.

 For example,

 fieldType name=terms class=string multivalued=true/

 Data : {a:1,b:2,c:3}

 You can split the data to a:1, b:2, c:3, and put them into terms.

 And then you can use filter query like fq=terms:a:1
 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com
 님이
 작성:

  We routinely store images and pdfs in Solr. There *is* a benefit, since
  you don't need to manage another storage system, you don't have to worry
  about Solr getting out of sync with the other system, you can use Solr
  replication for all your assets, etc.
 
  I don't use DIH, so personally I don't care whether it handles blobs, but
  it does seem like a natural extension for a system that indexes data from
  SQL in Solr.
 
  -Mike
 
 
  On 11/12/2014 01:31 PM, Anurag Sharma wrote:
 
  BLOB is non-searchable field so there is no benefit of storing it into
  Solr. Any external key-value store can be used to store the blob and
  reference of this blob can be stored as a string field in Solr.
 
  On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com
  wrote:
 
   I had a similar problem and didnt find any solution to use the fields
 in
  JSON
  Blob for a filter ... Not with DIH.
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 



Re: DIH Blob data

2014-11-14 Thread Erick Erickson
Just skimming, so maybe I misinterpreted.

ExternalFileField and ExternalFileFieldReloader
refer to storing values for each doc in an external file, they have
nothing to do with storing _files_.

The usual pattern is to have Solr store just enough data to have the
system-of-record return the actual file rather than have Solr
actually store the file. Solr isn't really built for this and while some
people do this it usually is a poor design if for no other reason than
as segments merge, the data gets copied again and again and again
to no good purpose.

Best,
Erick

On Fri, Nov 14, 2014 at 12:21 PM, Anurag Sharma anura...@gmail.com wrote:
 bq: We routinely store images and pdfs in Solr. There *is* a benefit, since
 you don't need to manage another storage system, you don't have to worry
 about Solr getting out of sync with the other system, you can use Solr
 replication for all your assets, etc.

 Do the same holds good for large Blobs like image, audio, video as well?
 Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html)
 but not sure how good is the Solr/Tika combination. Storing pdf and other
 docs could be useful in Solr, tika can extract metadata from the docs and
 make them discoverable.

 Considering all the above cases there should also be a support for File
 field type in Solr like other types Date, Float, Int, Long, String etc. but
 looks like there are only two file types (
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/)
 and both re external file storage.

- ExternalFileField.java

 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java
- ExternalFileFieldReloader.java

 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java

 What type can be used in schema when storing the files internally?


 On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote:

 How about this?

 First, define a field for filter query. It should be multivalued.

 Second, implements transformer to extract json dynamic fields, and put the
 dynamic fields into the solr field.

 For example,

 fieldType name=terms class=string multivalued=true/

 Data : {a:1,b:2,c:3}

 You can split the data to a:1, b:2, c:3, and put them into terms.

 And then you can use filter query like fq=terms:a:1
 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com
 님이
 작성:

  We routinely store images and pdfs in Solr. There *is* a benefit, since
  you don't need to manage another storage system, you don't have to worry
  about Solr getting out of sync with the other system, you can use Solr
  replication for all your assets, etc.
 
  I don't use DIH, so personally I don't care whether it handles blobs, but
  it does seem like a natural extension for a system that indexes data from
  SQL in Solr.
 
  -Mike
 
 
  On 11/12/2014 01:31 PM, Anurag Sharma wrote:
 
  BLOB is non-searchable field so there is no benefit of storing it into
  Solr. Any external key-value store can be used to store the blob and
  reference of this blob can be stored as a string field in Solr.
 
  On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com
  wrote:
 
   I had a similar problem and didnt find any solution to use the fields
 in
  JSON
  Blob for a filter ... Not with DIH.
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 



Re: DIH Blob data

2014-11-14 Thread Michael Sokolov

There is a binary type

-Mike

On 11/14/2014 12:21 PM, Anurag Sharma wrote:

bq: We routinely store images and pdfs in Solr. There *is* a benefit, since
you don't need to manage another storage system, you don't have to worry
about Solr getting out of sync with the other system, you can use Solr
replication for all your assets, etc.

Do the same holds good for large Blobs like image, audio, video as well?
Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html)
but not sure how good is the Solr/Tika combination. Storing pdf and other
docs could be useful in Solr, tika can extract metadata from the docs and
make them discoverable.

Considering all the above cases there should also be a support for File
field type in Solr like other types Date, Float, Int, Long, String etc. but
looks like there are only two file types (
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/)
and both re external file storage.

- ExternalFileField.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java
- ExternalFileFieldReloader.java

http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java

What type can be used in schema when storing the files internally?


On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote:


How about this?

First, define a field for filter query. It should be multivalued.

Second, implements transformer to extract json dynamic fields, and put the
dynamic fields into the solr field.

For example,

fieldType name=terms class=string multivalued=true/

Data : {a:1,b:2,c:3}

You can split the data to a:1, b:2, c:3, and put them into terms.

And then you can use filter query like fq=terms:a:1
2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com

님이

작성:


We routinely store images and pdfs in Solr. There *is* a benefit, since
you don't need to manage another storage system, you don't have to worry
about Solr getting out of sync with the other system, you can use Solr
replication for all your assets, etc.

I don't use DIH, so personally I don't care whether it handles blobs, but
it does seem like a natural extension for a system that indexes data from
SQL in Solr.

-Mike


On 11/12/2014 01:31 PM, Anurag Sharma wrote:


BLOB is non-searchable field so there is no benefit of storing it into
Solr. Any external key-value store can be used to store the blob and
reference of this blob can be stored as a string field in Solr.

On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com
wrote:

  I had a similar problem and didnt find any solution to use the fields

in

JSON
Blob for a filter ... Not with DIH.



--
View this message in context:


http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html

Sent from the Solr - User mailing list archive at Nabble.com.






Re: DIH Blob data

2014-11-14 Thread Michael Sokolov


On 11/14/2014 01:43 PM, Erick Erickson wrote:

Just skimming, so maybe I misinterpreted.

ExternalFileField and ExternalFileFieldReloader
refer to storing values for each doc in an external file, they have
nothing to do with storing _files_.

The usual pattern is to have Solr store just enough data to have the
system-of-record return the actual file rather than have Solr
actually store the file. Solr isn't really built for this and while some
people do this it usually is a poor design if for no other reason than
as segments merge, the data gets copied again and again and again
to no good purpose.
I was worried about this, and spent a bunch of time working on a custom 
codec that would store files externally (to avoid the merge penalty), 
while still living inside the Solr/Lucene ecosystem. It was a lot of 
complicated work, and after a while I thought I'd better do some careful 
performance measurements to make sure it was worthwhile.  What I found 
was that the merge cost was not very high relative to other indexing 
costs we were paying (indexing large full text documents with fairly 
complex analysis, but nothing unusual). So I don't think this particular 
performance argument against storage in Solr/Lucene is telling, at least 
for many ratios of stored doc size to indexed tokens size. It's also 
worth mentioning that my test involved reindexing every document once 
(basically a query-level replication of an existing index), so perhaps 
the amount of merging was less than it might be in other cases.


I can see that there might be other reasons to store documents 
elsewhere, but in my experience, with our use case, it actually works 
pretty well to store them in Lucene indexes.  Consider, for example, 
that if you are highlighting, you are probably already storing the full 
text of each document anyway. In our case we also need to store a 
marked-up version of the full text (so we can highlight an html view of 
a document as well as deliver plain text snippets), so the incremental 
cost of storing pdfs was not crushing.  Of course these could all be 
stored externally, too. Maybe we'll try that and get massive performance 
increases :)


-Mike


Re: DIH Blob data

2014-11-14 Thread Erick Erickson
Right, a more nuanced comment involves what _type_ of docs you're
storing, and what the ratio of searchable-to-overall size is. Consider
an image. The searchable data may be 0.01% of the file size. Or even
worse, a movie.

As always, it depends. I guess that personally I'm not a fan of
using Solr as a fie store when you have to be prepared to re-index
from scratch sometime _anyway_ (IMO), in which case you often might as
well serve the data from the system-of-record since it's there anyway.
IOW, I need to be convinced the use-case really merits it. And the
particular use-case may very well mean it's a fine solution

So if the use-case merits it, storing files in Solr is fine I just
wonder when it comes to docs with lots of non-searchable bytes and
relatively few searchable bytes.

Best,
Erick

On Fri, Nov 14, 2014 at 2:02 PM, Michael Sokolov
msoko...@safaribooksonline.com wrote:

 On 11/14/2014 01:43 PM, Erick Erickson wrote:

 Just skimming, so maybe I misinterpreted.

 ExternalFileField and ExternalFileFieldReloader
 refer to storing values for each doc in an external file, they have
 nothing to do with storing _files_.

 The usual pattern is to have Solr store just enough data to have the
 system-of-record return the actual file rather than have Solr
 actually store the file. Solr isn't really built for this and while some
 people do this it usually is a poor design if for no other reason than
 as segments merge, the data gets copied again and again and again
 to no good purpose.

 I was worried about this, and spent a bunch of time working on a custom
 codec that would store files externally (to avoid the merge penalty), while
 still living inside the Solr/Lucene ecosystem. It was a lot of complicated
 work, and after a while I thought I'd better do some careful performance
 measurements to make sure it was worthwhile.  What I found was that the
 merge cost was not very high relative to other indexing costs we were paying
 (indexing large full text documents with fairly complex analysis, but
 nothing unusual). So I don't think this particular performance argument
 against storage in Solr/Lucene is telling, at least for many ratios of
 stored doc size to indexed tokens size. It's also worth mentioning that my
 test involved reindexing every document once (basically a query-level
 replication of an existing index), so perhaps the amount of merging was less
 than it might be in other cases.

 I can see that there might be other reasons to store documents elsewhere,
 but in my experience, with our use case, it actually works pretty well to
 store them in Lucene indexes.  Consider, for example, that if you are
 highlighting, you are probably already storing the full text of each
 document anyway. In our case we also need to store a marked-up version of
 the full text (so we can highlight an html view of a document as well as
 deliver plain text snippets), so the incremental cost of storing pdfs was
 not crushing.  Of course these could all be stored externally, too. Maybe
 we'll try that and get massive performance increases :)

 -Mike


Re: DIH Blob data

2014-11-14 Thread Anurag Sharma
Thanks Michael  Eric for the succinct response.

On Sat, Nov 15, 2014 at 12:13 AM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 There is a binary type

 -Mike

 On 11/14/2014 12:21 PM, Anurag Sharma wrote:

 bq: We routinely store images and pdfs in Solr. There *is* a benefit,
 since
 you don't need to manage another storage system, you don't have to worry
 about Solr getting out of sync with the other system, you can use Solr
 replication for all your assets, etc.

 Do the same holds good for large Blobs like image, audio, video as well?
 Tika supports multiple file formats (http://tika.apache.org/1.5/
 formats.html)
 but not sure how good is the Solr/Tika combination. Storing pdf and other
 docs could be useful in Solr, tika can extract metadata from the docs and
 make them discoverable.

 Considering all the above cases there should also be a support for File
 field type in Solr like other types Date, Float, Int, Long, String etc.
 but
 looks like there are only two file types (
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/
 core/src/java/org/apache/solr/schema/)
 and both re external file storage.

 - ExternalFileField.java
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/
 core/src/java/org/apache/solr/schema/ExternalFileField.java
 - ExternalFileFieldReloader.java
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/
 core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java

 What type can be used in schema when storing the files internally?


 On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com
 wrote:

  How about this?

 First, define a field for filter query. It should be multivalued.

 Second, implements transformer to extract json dynamic fields, and put
 the
 dynamic fields into the solr field.

 For example,

 fieldType name=terms class=string multivalued=true/

 Data : {a:1,b:2,c:3}

 You can split the data to a:1, b:2, c:3, and put them into terms.

 And then you can use filter query like fq=terms:a:1
 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com

 님이

 작성:

  We routinely store images and pdfs in Solr. There *is* a benefit, since
 you don't need to manage another storage system, you don't have to worry
 about Solr getting out of sync with the other system, you can use Solr
 replication for all your assets, etc.

 I don't use DIH, so personally I don't care whether it handles blobs,
 but
 it does seem like a natural extension for a system that indexes data
 from
 SQL in Solr.

 -Mike


 On 11/12/2014 01:31 PM, Anurag Sharma wrote:

  BLOB is non-searchable field so there is no benefit of storing it into
 Solr. Any external key-value store can be used to store the blob and
 reference of this blob can be stored as a string field in Solr.

 On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com
 wrote:

   I had a similar problem and didnt find any solution to use the fields

 in

 JSON
 Blob for a filter ... Not with DIH.



 --
 View this message in context:

  http://lucene.472066.n3.nabble.com/DIH-Blob-data-
 tp4168896p4168925.html

 Sent from the Solr - User mailing list archive at Nabble.com.






Re: DIH Blob data

2014-11-12 Thread stockii
I had a similar problem and didnt find any solution to use the fields in JSON
Blob for a filter ... Not with DIH.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: DIH Blob data

2014-11-12 Thread Anurag Sharma
BLOB is non-searchable field so there is no benefit of storing it into
Solr. Any external key-value store can be used to store the blob and
reference of this blob can be stored as a string field in Solr.

On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote:

 I had a similar problem and didnt find any solution to use the fields in
 JSON
 Blob for a filter ... Not with DIH.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH Blob data

2014-11-12 Thread Michael Sokolov
We routinely store images and pdfs in Solr. There *is* a benefit, since 
you don't need to manage another storage system, you don't have to worry 
about Solr getting out of sync with the other system, you can use Solr 
replication for all your assets, etc.


I don't use DIH, so personally I don't care whether it handles blobs, 
but it does seem like a natural extension for a system that indexes data 
from SQL in Solr.


-Mike


On 11/12/2014 01:31 PM, Anurag Sharma wrote:

BLOB is non-searchable field so there is no benefit of storing it into
Solr. Any external key-value store can be used to store the blob and
reference of this blob can be stored as a string field in Solr.

On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote:


I had a similar problem and didnt find any solution to use the fields in
JSON
Blob for a filter ... Not with DIH.



--
View this message in context:
http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: DIH Blob data

2014-11-12 Thread Jeon Woosung
How about this?

First, define a field for filter query. It should be multivalued.

Second, implements transformer to extract json dynamic fields, and put the
dynamic fields into the solr field.

For example,

fieldType name=terms class=string multivalued=true/

Data : {a:1,b:2,c:3}

You can split the data to a:1, b:2, c:3, and put them into terms.

And then you can use filter query like fq=terms:a:1
2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com님이
작성:

 We routinely store images and pdfs in Solr. There *is* a benefit, since
 you don't need to manage another storage system, you don't have to worry
 about Solr getting out of sync with the other system, you can use Solr
 replication for all your assets, etc.

 I don't use DIH, so personally I don't care whether it handles blobs, but
 it does seem like a natural extension for a system that indexes data from
 SQL in Solr.

 -Mike


 On 11/12/2014 01:31 PM, Anurag Sharma wrote:

 BLOB is non-searchable field so there is no benefit of storing it into
 Solr. Any external key-value store can be used to store the blob and
 reference of this blob can be stored as a string field in Solr.

 On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com
 wrote:

  I had a similar problem and didnt find any solution to use the fields in
 JSON
 Blob for a filter ... Not with DIH.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html
 Sent from the Solr - User mailing list archive at Nabble.com.





DIH Blob data

2014-11-11 Thread Rahul
I am trying to index json data present under blob data type in data base.
JSON stored in database as {a:1,b:2,c:3}.

I want to Search based on fields later like fq= a:1.
The fields a,b,c are dynamic and can be anything based on data posted by
users.

What is the correct way to index data based on dynamic fields in Solr and
search them later based on those fields.

-- 

Rahul Ranjan