Re: Moving From Oracle Text Search To Solr
Besides the other notes here, I agree you'll hit OOM if you try to read all the rows into memory at once, but I'm absolutely sure you can read then N at a time instead. Not that I could tell you how, mind you. You're on your way... Erick On Tue, Mar 16, 2010 at 4:13 PM, Neil Chaudhuri < nchaudh...@potomacfusion.com> wrote: > Certainly I could use some basic SQL count(*) queries to achieve faceted > results, but I am not sure of the flexibility, extensibility, or scalability > of that approach. And from what I have read, Oracle Text doesn't do faceting > out of the box. > > Each document is a few MB, and there will be millions of them. I suppose it > depends on how I index them. I am pretty sure my current approach of using > Hibernate to load all rows, constructing Solr POJO's from them, and then > passing the POJO's to the embedded server would lead to a OOM error. I > should probably look into the other options. > > Thanks. > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, March 16, 2010 3:58 PM > To: solr-user@lucene.apache.org > Subject: Re: Moving From Oracle Text Search To Solr > > Why do you think you'd hit OOM errors? How big is "very large"? I've > indexed, as a single document, a 26 volume encyclopedia of civil war > records.. > > Although as much as I like the technology, if I could get away without > using > two technologies, I would. Are you completely sure you can't get what you > want with clever Oracle querying? > > Best > Erick > > On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < > nchaudh...@potomacfusion.com> wrote: > > > I am working on an application that currently hits a database containing > > millions of very large documents. I use Oracle Text Search at the moment, > > and things work fine. However, there is a request for faceting > capability, > > and Solr seems like a technology I should look at. Suffice to say I am > new > > to Solr, but at the moment I see two approaches-each with drawbacks: > > > > > > 1) Have Solr index document metadata (id, subject, date). Then Use > > Oracle Text to do a content search based on criteria. Finally, query the > > Solr index for all documents whose id's match the set of id's returned by > > Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. > > id:4ORid:33432323OR...). > > > > 2) Remove Oracle Text from the equation and use Solr to query > document > > content based on search criteria. The indexing process though will almost > > certainly encounter an OutOfMemoryError given the number and size of > > documents. > > > > > > > > I am using the embedded server and Solr Java APIs to do the indexing and > > querying. > > > > > > > > I would welcome your thoughts on the best way to approach this situation. > > Please let me know if I should provide additional information. > > > > > > > > Thanks. > > >
Re: Moving From Oracle Text Search To Solr
The DataImportHandler has tools for this. It will fetch rows from Oracle and allow you to unpack columns as XML with Xpaths. http://wiki.apache.org/solr/DataImportHandler http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor On Tue, Mar 16, 2010 at 2:25 PM, Neil Chaudhuri wrote: > That is a great article, David. > > For the moment, I am trying an all-Solr approach, but I have run into a small > problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. > Is there any facility to unpack this into the actual text? Or must I execute > that in the SQL query? > > Thanks. > > > -Original Message- > From: Smiley, David W. [mailto:dsmi...@mitre.org] > Sent: Tuesday, March 16, 2010 4:45 PM > To: solr-user@lucene.apache.org > Subject: Re: Moving From Oracle Text Search To Solr > > If you do stay with Oracle, please report back to the list how that went. In > order to get decent filtering and faceting performance, I believe you will > need to use "bitmapped indexes" which Oracle and some other databases support. > > You may want to check out my article on this subject: > http://www.packtpub.com/article/text-search-your-database-or-solr > > ~ David Smiley > Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ > > > On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: > >> Certainly I could use some basic SQL count(*) queries to achieve faceted >> results, but I am not sure of the flexibility, extensibility, or scalability >> of that approach. And from what I have read, Oracle Text doesn't do faceting >> out of the box. >> >> Each document is a few MB, and there will be millions of them. I suppose it >> depends on how I index them. I am pretty sure my current approach of using >> Hibernate to load all rows, constructing Solr POJO's from them, and then >> passing the POJO's to the embedded server would lead to a OOM error. I >> should probably look into the other options. >> >> Thanks. >> >> >> -----Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Tuesday, March 16, 2010 3:58 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Moving From Oracle Text Search To Solr >> >> Why do you think you'd hit OOM errors? How big is "very large"? I've >> indexed, as a single document, a 26 volume encyclopedia of civil war >> records.. >> >> Although as much as I like the technology, if I could get away without using >> two technologies, I would. Are you completely sure you can't get what you >> want with clever Oracle querying? >> >> Best >> Erick >> >> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < >> nchaudh...@potomacfusion.com> wrote: >> >>> I am working on an application that currently hits a database containing >>> millions of very large documents. I use Oracle Text Search at the moment, >>> and things work fine. However, there is a request for faceting capability, >>> and Solr seems like a technology I should look at. Suffice to say I am new >>> to Solr, but at the moment I see two approaches-each with drawbacks: >>> >>> >>> 1) Have Solr index document metadata (id, subject, date). Then Use >>> Oracle Text to do a content search based on criteria. Finally, query the >>> Solr index for all documents whose id's match the set of id's returned by >>> Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. >>> id:4ORid:33432323OR...). >>> >>> 2) Remove Oracle Text from the equation and use Solr to query document >>> content based on search criteria. The indexing process though will almost >>> certainly encounter an OutOfMemoryError given the number and size of >>> documents. >>> >>> >>> >>> I am using the embedded server and Solr Java APIs to do the indexing and >>> querying. >>> >>> >>> >>> I would welcome your thoughts on the best way to approach this situation. >>> Please let me know if I should provide additional information. >>> >>> >>> >>> Thanks. >>> > > > > > -- Lance Norskog goks...@gmail.com
RE: Moving From Oracle Text Search To Solr
That is a great article, David. For the moment, I am trying an all-Solr approach, but I have run into a small problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. Is there any facility to unpack this into the actual text? Or must I execute that in the SQL query? Thanks. -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Tuesday, March 16, 2010 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use "bitmapped indexes" which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: > Certainly I could use some basic SQL count(*) queries to achieve faceted > results, but I am not sure of the flexibility, extensibility, or scalability > of that approach. And from what I have read, Oracle Text doesn't do faceting > out of the box. > > Each document is a few MB, and there will be millions of them. I suppose it > depends on how I index them. I am pretty sure my current approach of using > Hibernate to load all rows, constructing Solr POJO's from them, and then > passing the POJO's to the embedded server would lead to a OOM error. I should > probably look into the other options. > > Thanks. > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, March 16, 2010 3:58 PM > To: solr-user@lucene.apache.org > Subject: Re: Moving From Oracle Text Search To Solr > > Why do you think you'd hit OOM errors? How big is "very large"? I've > indexed, as a single document, a 26 volume encyclopedia of civil war > records.. > > Although as much as I like the technology, if I could get away without using > two technologies, I would. Are you completely sure you can't get what you > want with clever Oracle querying? > > Best > Erick > > On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < > nchaudh...@potomacfusion.com> wrote: > >> I am working on an application that currently hits a database containing >> millions of very large documents. I use Oracle Text Search at the moment, >> and things work fine. However, there is a request for faceting capability, >> and Solr seems like a technology I should look at. Suffice to say I am new >> to Solr, but at the moment I see two approaches-each with drawbacks: >> >> >> 1) Have Solr index document metadata (id, subject, date). Then Use >> Oracle Text to do a content search based on criteria. Finally, query the >> Solr index for all documents whose id's match the set of id's returned by >> Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. >> id:4ORid:33432323OR...). >> >> 2) Remove Oracle Text from the equation and use Solr to query document >> content based on search criteria. The indexing process though will almost >> certainly encounter an OutOfMemoryError given the number and size of >> documents. >> >> >> >> I am using the embedded server and Solr Java APIs to do the indexing and >> querying. >> >> >> >> I would welcome your thoughts on the best way to approach this situation. >> Please let me know if I should provide additional information. >> >> >> >> Thanks. >>
Re: Moving From Oracle Text Search To Solr
If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use "bitmapped indexes" which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: > Certainly I could use some basic SQL count(*) queries to achieve faceted > results, but I am not sure of the flexibility, extensibility, or scalability > of that approach. And from what I have read, Oracle Text doesn't do faceting > out of the box. > > Each document is a few MB, and there will be millions of them. I suppose it > depends on how I index them. I am pretty sure my current approach of using > Hibernate to load all rows, constructing Solr POJO's from them, and then > passing the POJO's to the embedded server would lead to a OOM error. I should > probably look into the other options. > > Thanks. > > > -Original Message- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: Tuesday, March 16, 2010 3:58 PM > To: solr-user@lucene.apache.org > Subject: Re: Moving From Oracle Text Search To Solr > > Why do you think you'd hit OOM errors? How big is "very large"? I've > indexed, as a single document, a 26 volume encyclopedia of civil war > records.. > > Although as much as I like the technology, if I could get away without using > two technologies, I would. Are you completely sure you can't get what you > want with clever Oracle querying? > > Best > Erick > > On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < > nchaudh...@potomacfusion.com> wrote: > >> I am working on an application that currently hits a database containing >> millions of very large documents. I use Oracle Text Search at the moment, >> and things work fine. However, there is a request for faceting capability, >> and Solr seems like a technology I should look at. Suffice to say I am new >> to Solr, but at the moment I see two approaches-each with drawbacks: >> >> >> 1) Have Solr index document metadata (id, subject, date). Then Use >> Oracle Text to do a content search based on criteria. Finally, query the >> Solr index for all documents whose id's match the set of id's returned by >> Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. >> id:4ORid:33432323OR...). >> >> 2) Remove Oracle Text from the equation and use Solr to query document >> content based on search criteria. The indexing process though will almost >> certainly encounter an OutOfMemoryError given the number and size of >> documents. >> >> >> >> I am using the embedded server and Solr Java APIs to do the indexing and >> querying. >> >> >> >> I would welcome your thoughts on the best way to approach this situation. >> Please let me know if I should provide additional information. >> >> >> >> Thanks. >>
Re: Moving From Oracle Text Search To Solr
I've also index a concatenation of 50k journal articles (making a single document of several hundred MB of text) and it did not give me an OOM. -glen On 16 March 2010 15:57, Erick Erickson wrote: > Why do you think you'd hit OOM errors? How big is "very large"? I've > indexed, as a single document, a 26 volume encyclopedia of civil war > records.. > > Although as much as I like the technology, if I could get away without using > two technologies, I would. Are you completely sure you can't get what you > want with clever Oracle querying? > > Best > Erick > > On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < > nchaudh...@potomacfusion.com> wrote: > >> I am working on an application that currently hits a database containing >> millions of very large documents. I use Oracle Text Search at the moment, >> and things work fine. However, there is a request for faceting capability, >> and Solr seems like a technology I should look at. Suffice to say I am new >> to Solr, but at the moment I see two approaches-each with drawbacks: >> >> >> 1) Have Solr index document metadata (id, subject, date). Then Use >> Oracle Text to do a content search based on criteria. Finally, query the >> Solr index for all documents whose id's match the set of id's returned by >> Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. >> id:4ORid:33432323OR...). >> >> 2) Remove Oracle Text from the equation and use Solr to query document >> content based on search criteria. The indexing process though will almost >> certainly encounter an OutOfMemoryError given the number and size of >> documents. >> >> >> >> I am using the embedded server and Solr Java APIs to do the indexing and >> querying. >> >> >> >> I would welcome your thoughts on the best way to approach this situation. >> Please let me know if I should provide additional information. >> >> >> >> Thanks. >> > -- -
RE: Moving From Oracle Text Search To Solr
Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is "very large"? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < nchaudh...@potomacfusion.com> wrote: > I am working on an application that currently hits a database containing > millions of very large documents. I use Oracle Text Search at the moment, > and things work fine. However, there is a request for faceting capability, > and Solr seems like a technology I should look at. Suffice to say I am new > to Solr, but at the moment I see two approaches-each with drawbacks: > > > 1) Have Solr index document metadata (id, subject, date). Then Use > Oracle Text to do a content search based on criteria. Finally, query the > Solr index for all documents whose id's match the set of id's returned by > Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. > id:4ORid:33432323OR...). > > 2) Remove Oracle Text from the equation and use Solr to query document > content based on search criteria. The indexing process though will almost > certainly encounter an OutOfMemoryError given the number and size of > documents. > > > > I am using the embedded server and Solr Java APIs to do the indexing and > querying. > > > > I would welcome your thoughts on the best way to approach this situation. > Please let me know if I should provide additional information. > > > > Thanks. >
Re: Moving From Oracle Text Search To Solr
Why do you think you'd hit OOM errors? How big is "very large"? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri < nchaudh...@potomacfusion.com> wrote: > I am working on an application that currently hits a database containing > millions of very large documents. I use Oracle Text Search at the moment, > and things work fine. However, there is a request for faceting capability, > and Solr seems like a technology I should look at. Suffice to say I am new > to Solr, but at the moment I see two approaches-each with drawbacks: > > > 1) Have Solr index document metadata (id, subject, date). Then Use > Oracle Text to do a content search based on criteria. Finally, query the > Solr index for all documents whose id's match the set of id's returned by > Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. > id:4ORid:33432323OR...). > > 2) Remove Oracle Text from the equation and use Solr to query document > content based on search criteria. The indexing process though will almost > certainly encounter an OutOfMemoryError given the number and size of > documents. > > > > I am using the embedded server and Solr Java APIs to do the indexing and > querying. > > > > I would welcome your thoughts on the best way to approach this situation. > Please let me know if I should provide additional information. > > > > Thanks. >
Moving From Oracle Text Search To Solr
I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.