Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Erick Erickson
Besides the other notes here, I agree you'll hit OOM if you try to
read all the rows into memory at once, but I'm absolutely sure you
can read then N at a time instead. Not that I could tell you how, mind
you.

You're on your way...
Erick

On Tue, Mar 16, 2010 at 4:13 PM, Neil Chaudhuri <
nchaudh...@potomacfusion.com> wrote:

> Certainly I could use some basic SQL count(*) queries to achieve faceted
> results, but I am not sure of the flexibility, extensibility, or scalability
> of that approach. And from what I have read, Oracle Text doesn't do faceting
> out of the box.
>
> Each document is a few MB, and there will be millions of them. I suppose it
> depends on how I index them. I am pretty sure my current approach of using
> Hibernate to load all rows, constructing Solr POJO's from them, and then
> passing the POJO's to the embedded server would lead to a OOM error. I
> should probably look into the other options.
>
> Thanks.
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, March 16, 2010 3:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving From Oracle Text Search To Solr
>
> Why do you think you'd hit OOM errors? How big is "very large"? I've
> indexed, as a single document, a 26 volume encyclopedia of civil war
> records..
>
> Although as much as I like the technology, if I could get away without
> using
> two technologies, I would. Are you completely sure you can't get what you
> want with clever Oracle querying?
>
> Best
> Erick
>
> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
> nchaudh...@potomacfusion.com> wrote:
>
> > I am working on an application that currently hits a database containing
> > millions of very large documents. I use Oracle Text Search at the moment,
> > and things work fine. However, there is a request for faceting
> capability,
> > and Solr seems like a technology I should look at. Suffice to say I am
> new
> > to Solr, but at the moment I see two approaches-each with drawbacks:
> >
> >
> > 1)  Have Solr index document metadata (id, subject, date). Then Use
> > Oracle Text to do a content search based on criteria. Finally, query the
> > Solr index for all documents whose id's match the set of id's returned by
> > Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
> > id:4ORid:33432323OR...).
> >
> > 2)  Remove Oracle Text from the equation and use Solr to query
> document
> > content based on search criteria. The indexing process though will almost
> > certainly encounter an OutOfMemoryError given the number and size of
> > documents.
> >
> >
> >
> > I am using the embedded server and Solr Java APIs to do the indexing and
> > querying.
> >
> >
> >
> > I would welcome your thoughts on the best way to approach this situation.
> > Please let me know if I should provide additional information.
> >
> >
> >
> > Thanks.
> >
>


Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Lance Norskog
The DataImportHandler has tools for this. It will fetch rows from
Oracle and allow you to unpack columns as XML with  Xpaths.

http://wiki.apache.org/solr/DataImportHandler
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

On Tue, Mar 16, 2010 at 2:25 PM, Neil Chaudhuri
 wrote:
> That is a great article, David.
>
> For the moment, I am trying an all-Solr approach, but I have run into a small 
> problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
> Is there any facility to unpack this into the actual text? Or must I execute 
> that in the SQL query?
>
> Thanks.
>
>
> -Original Message-
> From: Smiley, David W. [mailto:dsmi...@mitre.org]
> Sent: Tuesday, March 16, 2010 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving From Oracle Text Search To Solr
>
> If you do stay with Oracle, please report back to the list how that went.  In 
> order to get decent filtering and faceting performance, I believe you will 
> need to use "bitmapped indexes" which Oracle and some other databases support.
>
> You may want to check out my article on this subject: 
> http://www.packtpub.com/article/text-search-your-database-or-solr
>
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>
>
> On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:
>
>> Certainly I could use some basic SQL count(*) queries to achieve faceted 
>> results, but I am not sure of the flexibility, extensibility, or scalability 
>> of that approach. And from what I have read, Oracle Text doesn't do faceting 
>> out of the box.
>>
>> Each document is a few MB, and there will be millions of them. I suppose it 
>> depends on how I index them. I am pretty sure my current approach of using 
>> Hibernate to load all rows, constructing Solr POJO's from them, and then 
>> passing the POJO's to the embedded server would lead to a OOM error. I 
>> should probably look into the other options.
>>
>> Thanks.
>>
>>
>> -----Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Tuesday, March 16, 2010 3:58 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Moving From Oracle Text Search To Solr
>>
>> Why do you think you'd hit OOM errors? How big is "very large"? I've
>> indexed, as a single document, a 26 volume encyclopedia of civil war
>> records..
>>
>> Although as much as I like the technology, if I could get away without using
>> two technologies, I would. Are you completely sure you can't get what you
>> want with clever Oracle querying?
>>
>> Best
>> Erick
>>
>> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
>> nchaudh...@potomacfusion.com> wrote:
>>
>>> I am working on an application that currently hits a database containing
>>> millions of very large documents. I use Oracle Text Search at the moment,
>>> and things work fine. However, there is a request for faceting capability,
>>> and Solr seems like a technology I should look at. Suffice to say I am new
>>> to Solr, but at the moment I see two approaches-each with drawbacks:
>>>
>>>
>>> 1)      Have Solr index document metadata (id, subject, date). Then Use
>>> Oracle Text to do a content search based on criteria. Finally, query the
>>> Solr index for all documents whose id's match the set of id's returned by
>>> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
>>> id:4ORid:33432323OR...).
>>>
>>> 2)      Remove Oracle Text from the equation and use Solr to query document
>>> content based on search criteria. The indexing process though will almost
>>> certainly encounter an OutOfMemoryError given the number and size of
>>> documents.
>>>
>>>
>>>
>>> I am using the embedded server and Solr Java APIs to do the indexing and
>>> querying.
>>>
>>>
>>>
>>> I would welcome your thoughts on the best way to approach this situation.
>>> Please let me know if I should provide additional information.
>>>
>>>
>>>
>>> Thanks.
>>>
>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


RE: Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
That is a great article, David. 

For the moment, I am trying an all-Solr approach, but I have run into a small 
problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
Is there any facility to unpack this into the actual text? Or must I execute 
that in the SQL query?

Thanks.


-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: Tuesday, March 16, 2010 4:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Moving From Oracle Text Search To Solr

If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use "bitmapped indexes" which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

> Certainly I could use some basic SQL count(*) queries to achieve faceted 
> results, but I am not sure of the flexibility, extensibility, or scalability 
> of that approach. And from what I have read, Oracle Text doesn't do faceting 
> out of the box.
> 
> Each document is a few MB, and there will be millions of them. I suppose it 
> depends on how I index them. I am pretty sure my current approach of using 
> Hibernate to load all rows, constructing Solr POJO's from them, and then 
> passing the POJO's to the embedded server would lead to a OOM error. I should 
> probably look into the other options.
> 
> Thanks.
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Tuesday, March 16, 2010 3:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving From Oracle Text Search To Solr
> 
> Why do you think you'd hit OOM errors? How big is "very large"? I've
> indexed, as a single document, a 26 volume encyclopedia of civil war
> records..
> 
> Although as much as I like the technology, if I could get away without using
> two technologies, I would. Are you completely sure you can't get what you
> want with clever Oracle querying?
> 
> Best
> Erick
> 
> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
> nchaudh...@potomacfusion.com> wrote:
> 
>> I am working on an application that currently hits a database containing
>> millions of very large documents. I use Oracle Text Search at the moment,
>> and things work fine. However, there is a request for faceting capability,
>> and Solr seems like a technology I should look at. Suffice to say I am new
>> to Solr, but at the moment I see two approaches-each with drawbacks:
>> 
>> 
>> 1)  Have Solr index document metadata (id, subject, date). Then Use
>> Oracle Text to do a content search based on criteria. Finally, query the
>> Solr index for all documents whose id's match the set of id's returned by
>> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
>> id:4ORid:33432323OR...).
>> 
>> 2)  Remove Oracle Text from the equation and use Solr to query document
>> content based on search criteria. The indexing process though will almost
>> certainly encounter an OutOfMemoryError given the number and size of
>> documents.
>> 
>> 
>> 
>> I am using the embedded server and Solr Java APIs to do the indexing and
>> querying.
>> 
>> 
>> 
>> I would welcome your thoughts on the best way to approach this situation.
>> Please let me know if I should provide additional information.
>> 
>> 
>> 
>> Thanks.
>> 






Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Smiley, David W.
If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use "bitmapped indexes" which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

> Certainly I could use some basic SQL count(*) queries to achieve faceted 
> results, but I am not sure of the flexibility, extensibility, or scalability 
> of that approach. And from what I have read, Oracle Text doesn't do faceting 
> out of the box.
> 
> Each document is a few MB, and there will be millions of them. I suppose it 
> depends on how I index them. I am pretty sure my current approach of using 
> Hibernate to load all rows, constructing Solr POJO's from them, and then 
> passing the POJO's to the embedded server would lead to a OOM error. I should 
> probably look into the other options.
> 
> Thanks.
> 
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Tuesday, March 16, 2010 3:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Moving From Oracle Text Search To Solr
> 
> Why do you think you'd hit OOM errors? How big is "very large"? I've
> indexed, as a single document, a 26 volume encyclopedia of civil war
> records..
> 
> Although as much as I like the technology, if I could get away without using
> two technologies, I would. Are you completely sure you can't get what you
> want with clever Oracle querying?
> 
> Best
> Erick
> 
> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
> nchaudh...@potomacfusion.com> wrote:
> 
>> I am working on an application that currently hits a database containing
>> millions of very large documents. I use Oracle Text Search at the moment,
>> and things work fine. However, there is a request for faceting capability,
>> and Solr seems like a technology I should look at. Suffice to say I am new
>> to Solr, but at the moment I see two approaches-each with drawbacks:
>> 
>> 
>> 1)  Have Solr index document metadata (id, subject, date). Then Use
>> Oracle Text to do a content search based on criteria. Finally, query the
>> Solr index for all documents whose id's match the set of id's returned by
>> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
>> id:4ORid:33432323OR...).
>> 
>> 2)  Remove Oracle Text from the equation and use Solr to query document
>> content based on search criteria. The indexing process though will almost
>> certainly encounter an OutOfMemoryError given the number and size of
>> documents.
>> 
>> 
>> 
>> I am using the embedded server and Solr Java APIs to do the indexing and
>> querying.
>> 
>> 
>> 
>> I would welcome your thoughts on the best way to approach this situation.
>> Please let me know if I should provide additional information.
>> 
>> 
>> 
>> Thanks.
>> 






Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Glen Newton
I've also index a concatenation of 50k journal articles (making a
single document of several hundred MB of text) and it did not give me
an OOM.

-glen


On 16 March 2010 15:57, Erick Erickson  wrote:
> Why do you think you'd hit OOM errors? How big is "very large"? I've
> indexed, as a single document, a 26 volume encyclopedia of civil war
> records..
>
> Although as much as I like the technology, if I could get away without using
> two technologies, I would. Are you completely sure you can't get what you
> want with clever Oracle querying?
>
> Best
> Erick
>
> On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
> nchaudh...@potomacfusion.com> wrote:
>
>> I am working on an application that currently hits a database containing
>> millions of very large documents. I use Oracle Text Search at the moment,
>> and things work fine. However, there is a request for faceting capability,
>> and Solr seems like a technology I should look at. Suffice to say I am new
>> to Solr, but at the moment I see two approaches-each with drawbacks:
>>
>>
>> 1)      Have Solr index document metadata (id, subject, date). Then Use
>> Oracle Text to do a content search based on criteria. Finally, query the
>> Solr index for all documents whose id's match the set of id's returned by
>> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
>> id:4ORid:33432323OR...).
>>
>> 2)      Remove Oracle Text from the equation and use Solr to query document
>> content based on search criteria. The indexing process though will almost
>> certainly encounter an OutOfMemoryError given the number and size of
>> documents.
>>
>>
>>
>> I am using the embedded server and Solr Java APIs to do the indexing and
>> querying.
>>
>>
>>
>> I would welcome your thoughts on the best way to approach this situation.
>> Please let me know if I should provide additional information.
>>
>>
>>
>> Thanks.
>>
>



-- 

-


RE: Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
Certainly I could use some basic SQL count(*) queries to achieve faceted 
results, but I am not sure of the flexibility, extensibility, or scalability of 
that approach. And from what I have read, Oracle Text doesn't do faceting out 
of the box.

Each document is a few MB, and there will be millions of them. I suppose it 
depends on how I index them. I am pretty sure my current approach of using 
Hibernate to load all rows, constructing Solr POJO's from them, and then 
passing the POJO's to the embedded server would lead to a OOM error. I should 
probably look into the other options.

Thanks.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, March 16, 2010 3:58 PM
To: solr-user@lucene.apache.org
Subject: Re: Moving From Oracle Text Search To Solr

Why do you think you'd hit OOM errors? How big is "very large"? I've
indexed, as a single document, a 26 volume encyclopedia of civil war
records..

Although as much as I like the technology, if I could get away without using
two technologies, I would. Are you completely sure you can't get what you
want with clever Oracle querying?

Best
Erick

On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
nchaudh...@potomacfusion.com> wrote:

> I am working on an application that currently hits a database containing
> millions of very large documents. I use Oracle Text Search at the moment,
> and things work fine. However, there is a request for faceting capability,
> and Solr seems like a technology I should look at. Suffice to say I am new
> to Solr, but at the moment I see two approaches-each with drawbacks:
>
>
> 1)  Have Solr index document metadata (id, subject, date). Then Use
> Oracle Text to do a content search based on criteria. Finally, query the
> Solr index for all documents whose id's match the set of id's returned by
> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
> id:4ORid:33432323OR...).
>
> 2)  Remove Oracle Text from the equation and use Solr to query document
> content based on search criteria. The indexing process though will almost
> certainly encounter an OutOfMemoryError given the number and size of
> documents.
>
>
>
> I am using the embedded server and Solr Java APIs to do the indexing and
> querying.
>
>
>
> I would welcome your thoughts on the best way to approach this situation.
> Please let me know if I should provide additional information.
>
>
>
> Thanks.
>


Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Erick Erickson
Why do you think you'd hit OOM errors? How big is "very large"? I've
indexed, as a single document, a 26 volume encyclopedia of civil war
records..

Although as much as I like the technology, if I could get away without using
two technologies, I would. Are you completely sure you can't get what you
want with clever Oracle querying?

Best
Erick

On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri <
nchaudh...@potomacfusion.com> wrote:

> I am working on an application that currently hits a database containing
> millions of very large documents. I use Oracle Text Search at the moment,
> and things work fine. However, there is a request for faceting capability,
> and Solr seems like a technology I should look at. Suffice to say I am new
> to Solr, but at the moment I see two approaches-each with drawbacks:
>
>
> 1)  Have Solr index document metadata (id, subject, date). Then Use
> Oracle Text to do a content search based on criteria. Finally, query the
> Solr index for all documents whose id's match the set of id's returned by
> Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
> id:4ORid:33432323OR...).
>
> 2)  Remove Oracle Text from the equation and use Solr to query document
> content based on search criteria. The indexing process though will almost
> certainly encounter an OutOfMemoryError given the number and size of
> documents.
>
>
>
> I am using the embedded server and Solr Java APIs to do the indexing and
> querying.
>
>
>
> I would welcome your thoughts on the best way to approach this situation.
> Please let me know if I should provide additional information.
>
>
>
> Thanks.
>


Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
I am working on an application that currently hits a database containing 
millions of very large documents. I use Oracle Text Search at the moment, and 
things work fine. However, there is a request for faceting capability, and Solr 
seems like a technology I should look at. Suffice to say I am new to Solr, but 
at the moment I see two approaches-each with drawbacks:


1)  Have Solr index document metadata (id, subject, date). Then Use Oracle 
Text to do a content search based on criteria. Finally, query the Solr index 
for all documents whose id's match the set of id's returned by Oracle Text. 
That strikes me as an unmanageable Boolean query.  (e.g. 
id:4ORid:33432323OR...).

2)  Remove Oracle Text from the equation and use Solr to query document 
content based on search criteria. The indexing process though will almost 
certainly encounter an OutOfMemoryError given the number and size of documents.



I am using the embedded server and Solr Java APIs to do the indexing and 
querying.



I would welcome your thoughts on the best way to approach this situation. 
Please let me know if I should provide additional information.



Thanks.