Re: DIH Blob data
bq: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. Do the same holds good for large Blobs like image, audio, video as well? Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html) but not sure how good is the Solr/Tika combination. Storing pdf and other docs could be useful in Solr, tika can extract metadata from the docs and make them discoverable. Considering all the above cases there should also be a support for File field type in Solr like other types Date, Float, Int, Long, String etc. but looks like there are only two file types ( http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/) and both re external file storage. - ExternalFileField.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java - ExternalFileFieldReloader.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java What type can be used in schema when storing the files internally? On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote: How about this? First, define a field for filter query. It should be multivalued. Second, implements transformer to extract json dynamic fields, and put the dynamic fields into the solr field. For example, fieldType name=terms class=string multivalued=true/ Data : {a:1,b:2,c:3} You can split the data to a:1, b:2, c:3, and put them into terms. And then you can use filter query like fq=terms:a:1 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com 님이 작성: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
Just skimming, so maybe I misinterpreted. ExternalFileField and ExternalFileFieldReloader refer to storing values for each doc in an external file, they have nothing to do with storing _files_. The usual pattern is to have Solr store just enough data to have the system-of-record return the actual file rather than have Solr actually store the file. Solr isn't really built for this and while some people do this it usually is a poor design if for no other reason than as segments merge, the data gets copied again and again and again to no good purpose. Best, Erick On Fri, Nov 14, 2014 at 12:21 PM, Anurag Sharma anura...@gmail.com wrote: bq: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. Do the same holds good for large Blobs like image, audio, video as well? Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html) but not sure how good is the Solr/Tika combination. Storing pdf and other docs could be useful in Solr, tika can extract metadata from the docs and make them discoverable. Considering all the above cases there should also be a support for File field type in Solr like other types Date, Float, Int, Long, String etc. but looks like there are only two file types ( http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/) and both re external file storage. - ExternalFileField.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java - ExternalFileFieldReloader.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java What type can be used in schema when storing the files internally? On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote: How about this? First, define a field for filter query. It should be multivalued. Second, implements transformer to extract json dynamic fields, and put the dynamic fields into the solr field. For example, fieldType name=terms class=string multivalued=true/ Data : {a:1,b:2,c:3} You can split the data to a:1, b:2, c:3, and put them into terms. And then you can use filter query like fq=terms:a:1 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com 님이 작성: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
There is a binary type -Mike On 11/14/2014 12:21 PM, Anurag Sharma wrote: bq: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. Do the same holds good for large Blobs like image, audio, video as well? Tika supports multiple file formats (http://tika.apache.org/1.5/formats.html) but not sure how good is the Solr/Tika combination. Storing pdf and other docs could be useful in Solr, tika can extract metadata from the docs and make them discoverable. Considering all the above cases there should also be a support for File field type in Solr like other types Date, Float, Int, Long, String etc. but looks like there are only two file types ( http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/) and both re external file storage. - ExternalFileField.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileField.java - ExternalFileFieldReloader.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java What type can be used in schema when storing the files internally? On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote: How about this? First, define a field for filter query. It should be multivalued. Second, implements transformer to extract json dynamic fields, and put the dynamic fields into the solr field. For example, fieldType name=terms class=string multivalued=true/ Data : {a:1,b:2,c:3} You can split the data to a:1, b:2, c:3, and put them into terms. And then you can use filter query like fq=terms:a:1 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com 님이 작성: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
On 11/14/2014 01:43 PM, Erick Erickson wrote: Just skimming, so maybe I misinterpreted. ExternalFileField and ExternalFileFieldReloader refer to storing values for each doc in an external file, they have nothing to do with storing _files_. The usual pattern is to have Solr store just enough data to have the system-of-record return the actual file rather than have Solr actually store the file. Solr isn't really built for this and while some people do this it usually is a poor design if for no other reason than as segments merge, the data gets copied again and again and again to no good purpose. I was worried about this, and spent a bunch of time working on a custom codec that would store files externally (to avoid the merge penalty), while still living inside the Solr/Lucene ecosystem. It was a lot of complicated work, and after a while I thought I'd better do some careful performance measurements to make sure it was worthwhile. What I found was that the merge cost was not very high relative to other indexing costs we were paying (indexing large full text documents with fairly complex analysis, but nothing unusual). So I don't think this particular performance argument against storage in Solr/Lucene is telling, at least for many ratios of stored doc size to indexed tokens size. It's also worth mentioning that my test involved reindexing every document once (basically a query-level replication of an existing index), so perhaps the amount of merging was less than it might be in other cases. I can see that there might be other reasons to store documents elsewhere, but in my experience, with our use case, it actually works pretty well to store them in Lucene indexes. Consider, for example, that if you are highlighting, you are probably already storing the full text of each document anyway. In our case we also need to store a marked-up version of the full text (so we can highlight an html view of a document as well as deliver plain text snippets), so the incremental cost of storing pdfs was not crushing. Of course these could all be stored externally, too. Maybe we'll try that and get massive performance increases :) -Mike
Re: DIH Blob data
Right, a more nuanced comment involves what _type_ of docs you're storing, and what the ratio of searchable-to-overall size is. Consider an image. The searchable data may be 0.01% of the file size. Or even worse, a movie. As always, it depends. I guess that personally I'm not a fan of using Solr as a fie store when you have to be prepared to re-index from scratch sometime _anyway_ (IMO), in which case you often might as well serve the data from the system-of-record since it's there anyway. IOW, I need to be convinced the use-case really merits it. And the particular use-case may very well mean it's a fine solution So if the use-case merits it, storing files in Solr is fine I just wonder when it comes to docs with lots of non-searchable bytes and relatively few searchable bytes. Best, Erick On Fri, Nov 14, 2014 at 2:02 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 11/14/2014 01:43 PM, Erick Erickson wrote: Just skimming, so maybe I misinterpreted. ExternalFileField and ExternalFileFieldReloader refer to storing values for each doc in an external file, they have nothing to do with storing _files_. The usual pattern is to have Solr store just enough data to have the system-of-record return the actual file rather than have Solr actually store the file. Solr isn't really built for this and while some people do this it usually is a poor design if for no other reason than as segments merge, the data gets copied again and again and again to no good purpose. I was worried about this, and spent a bunch of time working on a custom codec that would store files externally (to avoid the merge penalty), while still living inside the Solr/Lucene ecosystem. It was a lot of complicated work, and after a while I thought I'd better do some careful performance measurements to make sure it was worthwhile. What I found was that the merge cost was not very high relative to other indexing costs we were paying (indexing large full text documents with fairly complex analysis, but nothing unusual). So I don't think this particular performance argument against storage in Solr/Lucene is telling, at least for many ratios of stored doc size to indexed tokens size. It's also worth mentioning that my test involved reindexing every document once (basically a query-level replication of an existing index), so perhaps the amount of merging was less than it might be in other cases. I can see that there might be other reasons to store documents elsewhere, but in my experience, with our use case, it actually works pretty well to store them in Lucene indexes. Consider, for example, that if you are highlighting, you are probably already storing the full text of each document anyway. In our case we also need to store a marked-up version of the full text (so we can highlight an html view of a document as well as deliver plain text snippets), so the incremental cost of storing pdfs was not crushing. Of course these could all be stored externally, too. Maybe we'll try that and get massive performance increases :) -Mike
Re: DIH Blob data
Thanks Michael Eric for the succinct response. On Sat, Nov 15, 2014 at 12:13 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: There is a binary type -Mike On 11/14/2014 12:21 PM, Anurag Sharma wrote: bq: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. Do the same holds good for large Blobs like image, audio, video as well? Tika supports multiple file formats (http://tika.apache.org/1.5/ formats.html) but not sure how good is the Solr/Tika combination. Storing pdf and other docs could be useful in Solr, tika can extract metadata from the docs and make them discoverable. Considering all the above cases there should also be a support for File field type in Solr like other types Date, Float, Int, Long, String etc. but looks like there are only two file types ( http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/ core/src/java/org/apache/solr/schema/) and both re external file storage. - ExternalFileField.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/ core/src/java/org/apache/solr/schema/ExternalFileField.java - ExternalFileFieldReloader.java http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/ core/src/java/org/apache/solr/schema/ExternalFileFieldReloader.java What type can be used in schema when storing the files internally? On Thu, Nov 13, 2014 at 3:48 AM, Jeon Woosung jeonwoos...@gmail.com wrote: How about this? First, define a field for filter query. It should be multivalued. Second, implements transformer to extract json dynamic fields, and put the dynamic fields into the solr field. For example, fieldType name=terms class=string multivalued=true/ Data : {a:1,b:2,c:3} You can split the data to a:1, b:2, c:3, and put them into terms. And then you can use filter query like fq=terms:a:1 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com 님이 작성: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data- tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Blob data
How about this? First, define a field for filter query. It should be multivalued. Second, implements transformer to extract json dynamic fields, and put the dynamic fields into the solr field. For example, fieldType name=terms class=string multivalued=true/ Data : {a:1,b:2,c:3} You can split the data to a:1, b:2, c:3, and put them into terms. And then you can use filter query like fq=terms:a:1 2014. 11. 13. 오전 3:59에 Michael Sokolov msoko...@safaribooksonline.com님이 작성: We routinely store images and pdfs in Solr. There *is* a benefit, since you don't need to manage another storage system, you don't have to worry about Solr getting out of sync with the other system, you can use Solr replication for all your assets, etc. I don't use DIH, so personally I don't care whether it handles blobs, but it does seem like a natural extension for a system that indexes data from SQL in Solr. -Mike On 11/12/2014 01:31 PM, Anurag Sharma wrote: BLOB is non-searchable field so there is no benefit of storing it into Solr. Any external key-value store can be used to store the blob and reference of this blob can be stored as a string field in Solr. On Wed, Nov 12, 2014 at 5:56 PM, stockii stock.jo...@googlemail.com wrote: I had a similar problem and didnt find any solution to use the fields in JSON Blob for a filter ... Not with DIH. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Blob-data-tp4168896p4168925.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH Blob data
I am trying to index json data present under blob data type in data base. JSON stored in database as {a:1,b:2,c:3}. I want to Search based on fields later like fq= a:1. The fields a,b,c are dynamic and can be anything based on data posted by users. What is the correct way to index data based on dynamic fields in Solr and search them later based on those fields. -- Rahul Ranjan