Out of Memory
I am using the DataImportHandler to index literally millions of documents in an Oracle database. Not surprisingly, I got the following after a few hours: java.sql.SQLException: ORA-04030: out of process memory when trying to allocate 4032 bytes (kolaGetRfcHeap,kghsseg: kolaslCreateCtx) Has anyone come across this? What are the ways around this, if any? Thanks.
XPath Processing Applied to Clob
I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks.
RE: XPath Processing Applied to Clob
Incidentally, I tried adding this: datasource name=f type=FieldReaderDataSource / document entity dataSource=f processor=XPathEntityProcessor dataField=d.text forEach=/MESSAGE field column=body xpath=//BODY/ /entity /document But this didn't seem to change anything. Any insight is appreciated. Thanks. From: Neil Chaudhuri Sent: Wednesday, March 17, 2010 3:24 PM To: solr-user@lucene.apache.org Subject: XPath Processing Applied to Clob I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks.
Moving From Oracle Text Search To Solr
I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
RE: Moving From Oracle Text Search To Solr
That is a great article, David. For the moment, I am trying an all-Solr approach, but I have run into a small problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. Is there any facility to unpack this into the actual text? Or must I execute that in the SQL query? Thanks. -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Tuesday, March 16, 2010 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use bitmapped indexes which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
Indexing CLOB Column in Oracle
Since my original thread was straying to a new topic, I thought it made sense to create a new thread of discussion. I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type, which is an instance of oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob. So in my db-data-config, I have the following: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID FROM DOC d field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / entity name=text query=SELECT d.XML FROM DOC d WHERE d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' transformer=ClobTransformer field column=XML name=text clob=true sourceColName=XML / /entity /entity /document Meanwhile, I have this in schema.xml: field name=text type=text_ws indexed=true stored=true multiValued=true omitNorms=false termVectors=true / However, when I take a look at my indexes with Luke, I find that the items labeled text simply say oracle.sql.OPAQUE and a bunch of numbers-in other words, the OPAQUE.toString(). Can you give me some insight into where I am going wrong? Thanks.