Out of Memory

2010-03-23 Thread Neil Chaudhuri
I am using the DataImportHandler to index literally millions of documents in an 
Oracle database. Not surprisingly, I got the following after a few hours:

java.sql.SQLException: ORA-04030: out of process memory when trying to allocate 
4032 bytes (kolaGetRfcHeap,kghsseg: kolaslCreateCtx)

Has anyone come across this? What are the ways around this, if any?

Thanks.


XPath Processing Applied to Clob

2010-03-17 Thread Neil Chaudhuri
I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type. Since this is nothing more 
than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. 
However, I don't want to index/store all the XML but instead just the XML 
within a set of tags. The XPath itself is trivial, but it seems like the 
XPathEntityProcessor only works for XML file content rather than the output of 
a Transformer.

Here is what I currently have that fails:


document

entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, 
d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer

field column=EFFECTIVE_DT name=effectiveDate /

field column=ARCHIVE_ID name=id /

field column=TEXT name=text clob=true
entity name=text processor=XPathEntityProcessor 
forEach=/MESSAGE url=${doc.text}
field column=body xpath=//BODY/

/entity

/entity

/document


Is there an easy way to do this without writing my own custom transformer?

Thanks.


RE: XPath Processing Applied to Clob

2010-03-17 Thread Neil Chaudhuri
Incidentally, I tried adding this:

datasource name=f type=FieldReaderDataSource /
document
entity dataSource=f processor=XPathEntityProcessor 
dataField=d.text forEach=/MESSAGE
  field column=body xpath=//BODY/
/entity
/document

But this didn't seem to change anything.

Any insight is appreciated.

Thanks.



From: Neil Chaudhuri
Sent: Wednesday, March 17, 2010 3:24 PM
To: solr-user@lucene.apache.org
Subject: XPath Processing Applied to Clob

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type. Since this is nothing more 
than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. 
However, I don't want to index/store all the XML but instead just the XML 
within a set of tags. The XPath itself is trivial, but it seems like the 
XPathEntityProcessor only works for XML file content rather than the output of 
a Transformer.

Here is what I currently have that fails:


document

entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, 
d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer

field column=EFFECTIVE_DT name=effectiveDate /

field column=ARCHIVE_ID name=id /

field column=TEXT name=text clob=true
entity name=text processor=XPathEntityProcessor 
forEach=/MESSAGE url=${doc.text}
field column=body xpath=//BODY/

/entity

/entity

/document


Is there an easy way to do this without writing my own custom transformer?

Thanks.


Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
I am working on an application that currently hits a database containing 
millions of very large documents. I use Oracle Text Search at the moment, and 
things work fine. However, there is a request for faceting capability, and Solr 
seems like a technology I should look at. Suffice to say I am new to Solr, but 
at the moment I see two approaches-each with drawbacks:


1)  Have Solr index document metadata (id, subject, date). Then Use Oracle 
Text to do a content search based on criteria. Finally, query the Solr index 
for all documents whose id's match the set of id's returned by Oracle Text. 
That strikes me as an unmanageable Boolean query.  (e.g. 
id:4ORid:33432323OR...).

2)  Remove Oracle Text from the equation and use Solr to query document 
content based on search criteria. The indexing process though will almost 
certainly encounter an OutOfMemoryError given the number and size of documents.



I am using the embedded server and Solr Java APIs to do the indexing and 
querying.



I would welcome your thoughts on the best way to approach this situation. 
Please let me know if I should provide additional information.



Thanks.


RE: Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri
That is a great article, David. 

For the moment, I am trying an all-Solr approach, but I have run into a small 
problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
Is there any facility to unpack this into the actual text? Or must I execute 
that in the SQL query?

Thanks.


-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: Tuesday, March 16, 2010 4:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Moving From Oracle Text Search To Solr

If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use bitmapped indexes which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.
 
 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I should 
 probably look into the other options.
 
 Thanks.
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr
 
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..
 
 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?
 
 Best
 Erick
 
 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:
 
 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).
 
 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.
 
 
 
 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.
 
 
 
 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.
 
 
 
 Thanks.
 






Indexing CLOB Column in Oracle

2010-03-16 Thread Neil Chaudhuri
Since my original thread was straying to a new topic, I thought it made sense 
to create a new thread of discussion.

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type, which is an instance of 
oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.

So in my db-data-config, I have the following:

document
entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID FROM DOC 
d
field column=EFFECTIVE_DT name=effectiveDate /
field column=ARCHIVE_ID name=id /
entity name=text query=SELECT d.XML FROM DOC d WHERE 
d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' transformer=ClobTransformer
field column=XML name=text clob=true sourceColName=XML 
/
/entity
/entity
/document

Meanwhile, I have this in schema.xml:

field name=text type=text_ws indexed=true stored=true 
multiValued=true omitNorms=false termVectors=true /

However, when I take a look at my indexes with Luke, I find that the items 
labeled text simply say oracle.sql.OPAQUE and a bunch of numbers-in other 
words, the OPAQUE.toString().

Can you give me some insight into where I am going wrong?

Thanks.