date:20100316


Most of our documents will be in English but not all and we are certain in
the process of acquiring more international content. Does anyone have any
experience using all of the different stemmers for languages of unknown
origin? Which ones perform the best? Give the most relevant results? What
are the main advantages of each one? I've heard that the KStemmer is a
less-aggressive stemmer and it is supposed to perform quite well will it
work for multi-languages? 

Any suggestions would be appreciated. Thanks
 
-- 
View this message in context: 
http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn

I used it mostly for KStemmer, but I also liked the fact that it included about 
a dozen or so stable patches since Solr 1.4 was released. We just use the 
included WAR in our project however. We don't use the installer or anything 
like that.






From: blargy zman...@hotmail.com
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 11:52:17 AM
Subject: LucidWorks Solr


Has anyone used this?:
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr

Other than the KStemmer and installer what are the other enhancements that
this download offers? Is it worth using over the default Solr installation?

Thanks

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: LucidWorks Solr

2010-03-16 Thread AJ Chen

I'm trying it out right now. I hope it will work well out-of-box for
indexing/searching a set of documents with frequent update.
-aj

On Tue, Mar 16, 2010 at 11:52 AM, blargy zman...@hotmail.com wrote:


 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr

 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr installation?

 Thanks

 --
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
650-283-4091
*Building social media monitoring pipeline, and connecting social customers
to CRM*

Re: Stemming suggestions

If you search the mail archive, you'll find many discussions of
multilingual indexing/searching that'll provide you a plethora
of information.

But the synopsis as I remember is that using a single stemmer for
multiple languages is generally a bad idea

Best
Erick

On Tue, Mar 16, 2010 at 12:19 PM, blargy zman...@hotmail.com wrote:


 Most of our documents will be in English but not all and we are certain in
 the process of acquiring more international content. Does anyone have any
 experience using all of the different stemmers for languages of unknown
 origin? Which ones perform the best? Give the most relevant results? What
 are the main advantages of each one? I've heard that the KStemmer is a
 less-aggressive stemmer and it is supposed to perform quite well will it
 work for multi-languages?

 Any suggestions would be appreciated. Thanks

 --
 View this message in context:
 http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri

I am working on an application that currently hits a database containing 
millions of very large documents. I use Oracle Text Search at the moment, and 
things work fine. However, there is a request for faceting capability, and Solr 
seems like a technology I should look at. Suffice to say I am new to Solr, but 
at the moment I see two approaches-each with drawbacks:


1)  Have Solr index document metadata (id, subject, date). Then Use Oracle 
Text to do a content search based on criteria. Finally, query the Solr index 
for all documents whose id's match the set of id's returned by Oracle Text. 
That strikes me as an unmanageable Boolean query.  (e.g. 
id:4ORid:33432323OR...).

2)  Remove Oracle Text from the equation and use Solr to query document 
content based on search criteria. The indexing process though will almost 
certainly encounter an OutOfMemoryError given the number and size of documents.



I am using the embedded server and Solr Java APIs to do the indexing and 
querying.



I would welcome your thoughts on the best way to approach this situation. 
Please let me know if I should provide additional information.



Thanks.

Re: LucidWorks Solr


Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
 
 I used it mostly for KStemmer, but I also liked the fact that it included
 about a dozen or so stable patches since Solr 1.4 was released. We just
 use the included WAR in our project however. We don't use the installer or
 anything like that.
 
 
 
 
 
 
 From: blargy zman...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 16, 2010 11:52:17 AM
 Subject: LucidWorks Solr
 
 
 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
 
 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr
 installation?
 
 Thanks
 
 -- 
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
   
 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Moving From Oracle Text Search To Solr

Why do you think you'd hit OOM errors? How big is very large? I've
indexed, as a single document, a 26 volume encyclopedia of civil war
records..

Although as much as I like the technology, if I could get away without using
two technologies, I would. Are you completely sure you can't get what you
want with clever Oracle querying?

Best
Erick

On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.

XML data in solr field

2010-03-16 Thread Nair, Manas

Hello Experts,
 
I need help on this issue of mine. I am unsure if this scenario is possible.
I have a field in my solr document named inputxml, the value of which is a 
xml string as below. This xml structure is within the inputxml field value. I 
needed help on searching this xml structure i.e. if I search  for Venue, I 
should get Radio City Music Hall as the result and not the complete tag like 
Venue value=Radio City Music Hall /. Is this supported in solr?? If it is, 
how can this be implemented??
 
root
Venue value=Radio City Music Hall /
Link value=http://bit.ly/Rndab; /
LinkText value=En savoir + /
Address value=New-York, USA /
/root

Any help is appreciated. I donot need the tag name in the result, instead I 
need the tag value.
 
Thanks in advance,
Manas Nair

Re: LucidWorks Solr

2010-03-16 Thread Kevin Osborn

For my purposes, the Porter analyzer was overly aggressive with stemming. So, 
we then moved to KStem. It looks like this is no longer being maintained and 
Lucid claimed much better performance with theirs, so I gave that a try and it 
seems to be working fine. I didn't do any benchmarks though.

And I just took the war in LucidWorks\dist. I think in the install 
instructions, there was also a script to apply to the included source code as 
well. I did that as well since I look at the source regularly.

I didn't look at LudidGlaze or any of the other Lucid features.

-Kevin





From: blargy zman...@hotmail.com
To: solr-user@lucene.apache.org
Sent: Tue, March 16, 2010 12:31:09 PM
Subject: Re: LucidWorks Solr


Kevin,

When you say you just included the war you mean the /packs/solr.war correct?
I see that the KStemmer is nicely packed in there but I don't see LucidGaze
anywhere. Have you had any experience using this? 

So I'm guessing you would suggest using the LucidWorks solr.war over the
apache-solr-war just because of the various bug-fixes/tests. 

As a side question. Is there a reason you choose the LucidKStemmer over any
other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would
work best. Thanks again!


Kevin Osborn-2 wrote:
 
 I used it mostly for KStemmer, but I also liked the fact that it included
 about a dozen or so stable patches since Solr 1.4 was released. We just
 use the included WAR in our project however. We don't use the installer or
 anything like that.
 
 
 
 
 
 
 From: blargy zman...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tue, March 16, 2010 11:52:17 AM
 Subject: LucidWorks Solr
 
 
 Has anyone used this?:
 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr
 
 Other than the KStemmer and installer what are the other enhancements
 that
 this download offers? Is it worth using over the default Solr
 installation?
 
 Thanks
 
 -- 
 View this message in context:
 http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
  
 

-- 
View this message in context: 
http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Glen Newton

I've also index a concatenation of 50k journal articles (making a
single document of several hundred MB of text) and it did not give me
an OOM.

-glen


On 16 March 2010 15:57, Erick Erickson erickerick...@gmail.com wrote:
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)      Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)      Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.





-- 

-

PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade

I've been trying to bulk index about 11 million PDFs, and while profiling our 
Solr instance, I noticed that all of the threads that are processing indexing 
requests are constantly blocking each other during this call:

http-8080-Processor39 [BLOCKED] CPU time: 9:35
java.util.Collections$SynchronizedMap.get(Object)
org.pdfbox.pdmodel.font.PDFont.getAFM()
org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
org.pdfbox.util.PDFStreamEngine.showString(byte[])
org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream)
org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
org.pdfbox.util.PDFTextStripper.processPages(List)
org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
org.pdfbox.util.PDFTextStripper.getText(PDDocument)
org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
Metadata)
org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
 SolrQueryResponse, ContentStream)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()

Has anyone run into this before? Any ideas on how to reduce the contention?

Thanks,
Gio.

Re: Moving From Oracle Text Search To Solr

2010-03-16 Thread Smiley, David W.

If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use bitmapped indexes which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.
 
 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I should 
 probably look into the other options.
 
 Thanks.
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr
 
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..
 
 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?
 
 Best
 Erick
 
 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:
 
 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).
 
 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.
 
 
 
 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.
 
 
 
 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.
 
 
 
 Thanks.

Re: XML data in solr field

2010-03-16 Thread Tommy Chheng

 Do you have the option of just importing each xml node as a 
field/value when you add the document?


That'll let you do the search easily. If you need to store the raw XML, 
you can use an extra field.


Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com


On 3/16/10 12:59 PM, Nair, Manas wrote:

Hello Experts,

I need help on this issue of mine. I am unsure if this scenario is possible.
I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml 
structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search  
for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue 
value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be 
implemented??

root
Venue value=Radio City Music Hall /
Link value=http://bit.ly/Rndab; /
LinkText value=En savoir + /
Address value=New-York, USA /
/root

Any help is appreciated. I donot need the tag name in the result, instead I 
need the tag value.

Thanks in advance,
Manas Nair

Solr RAM Requirements

2010-03-16 Thread KaktuChakarabati


Hey,
I am trying to understand what kind of calculation I should do in order to
come up with reasonable RAM size for a given solr machine.

Suppose the index size is at 16GB.
The Max heap allocated to JVM is about 12GB.

The machine I'm trying now has 24GB.
When the machine is running for a while serving production, I can see in top
that the resident memory taken by the jvm is indeed at 12gb.
Now, on top of this i should assume that if i want the whole index to fit in
disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind
of calculation correct or am i off here?

Any other recommendations Anyone could make w.r.t these numbers ?

Thanks,
-Chak
-- 
View this message in context: 
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Grant Ingersoll

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()
 
 Has anyone run into this before? Any ideas on how to reduce the contention?
 
 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

RE: Moving From Oracle Text Search To Solr

2010-03-16 Thread Neil Chaudhuri

That is a great article, David. 

For the moment, I am trying an all-Solr approach, but I have run into a small 
problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
Is there any facility to unpack this into the actual text? Or must I execute 
that in the SQL query?

Thanks.


-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: Tuesday, March 16, 2010 4:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Moving From Oracle Text Search To Solr

If you do stay with Oracle, please report back to the list how that went.  In 
order to get decent filtering and faceting performance, I believe you will need 
to use bitmapped indexes which Oracle and some other databases support.

You may want to check out my article on this subject: 
http://www.packtpub.com/article/text-search-your-database-or-solr

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.
 
 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I should 
 probably look into the other options.
 
 Thanks.
 
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr
 
 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..
 
 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?
 
 Best
 Erick
 
 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:
 
 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
 1)  Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).
 
 2)  Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.
 
 
 
 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.
 
 
 
 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.
 
 
 
 Thanks.

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot. 

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()
 
 Has anyone run into this before? Any ideas on how to reduce the contention?
 
 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)

Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.

See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).

Cheers,
Chris

[1] http://issues.apache.org/jira/browse/TIKA-380
[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html


On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:

Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.

-Original Message-
From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
Sent: Tuesday, March 16, 2010 5:15 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?

FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.

-Grant

On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:

 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:

 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 java.util.Collections$SynchronizedMap.get(Object)
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 org.pdfbox.util.PDFTextStripper.processPages(List)
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)
 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)
 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
 org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
 org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
 org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
 org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
 org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
 org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
 org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
  Object[])
 org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, 
 TcpConnection, Object[])
 org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
 org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
 java.lang.Thread.run()

 Has anyone run into this before? Any ideas on how to reduce the contention?

 Thanks,
 Gio.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/

Re: Trouble Implementing Extracting Request Handler

NoClassDefFoundError usually means that the class was found, but it
needs other classes and those were not found. That is, Solr finds the
ExtractingRequestHandler jar but cannot find the Tika jars.

In example/solr/conf/slrconfig.xml, there are several 'lib
dir=path/' elements. These give classpath directories and jar files
to include when loading classes (and resource files). Try adding the
paths for your Tika jars as lib/ directives.

On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgut sreich...@axtaweb.com wrote:
 Sure. I've attached two docs that have the stack trace and the full list of
 .jar files.

 On 3/15/2010 8:34 PM, Lance Norskog wrote:

 Please post the complete stack trace. Also, it will help if you make a
 full listing of all .jar files in the example/ directory.

 On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:


 Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
 follow-on error though. It is giving the following error:
 ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

 Did we miss something else in the setup?

 Steve

 Is there something else we haven't copied

 On 3/15/2010 6:12 PM, Lance Norskog wrote:


 This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

 The ExtractingRequestHandler libraries are in contrib/extracting/lib

 You need to make a directory example/solr/lib and copy into it the
 apache-solr-cell jar from dist/ and all of the libraries from
 contrib/extracting/lib. The Wiki page has not been updated for the
 Solr 1.4 release. I just added a TODO to this effect.

 On 3/12/10, Steve Reichgutsreich...@axtaweb.com    wrote:



 Hi Grant,
 Thanks for the feedback. In reading the Wiki, it recommended that you
 copy everything from example/solr/libs directory into a /libs directory
 in your instance. I went into my example/solr directory and only see
 two
 directories - bin and conf. There is no libs directory. Where
 else
 can I get the contents of what should be in libs?

 Steve

 On 3/12/2010 2:15 PM, Grant Ingersoll wrote:



 On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:





 Now that I have configured my Solr instance for standard indexing, I
 wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
 it
 with a simple PDF file, I got the following error:

    org.apache.solr.common.SolrException: lazy loading error
    Caused by: org.apache.solr.common.SolrException: Error loading
 class
    'org.apache.solr.handler.extraction.ExtractingRequestHandler'

 Based on the error, it appeared that the problem is caused by certain
 components not being installed or installed correctly. Since I am not
 a
 Java guy, I had my Java person try to install the
 ExtractingRequestHandler to no avail. He had said that he was having
 real
 trouble finding good documentation on how to install and enable this
 handler.

 Could anyone point me to good documentation on how to
 install/troubleshoot this?




 http://wiki.apache.org/solr/ExtractingRequestHandler

 Essentially, you need to make sure the ERH stuff is in Solr/lib before
 starting.

 -Grant



















-- 
Lance Norskog
goks...@gmail.com

Re: DIH request parameters

They are a namespace like other namespaces and are useable in
attributes, just like in the DB query string examples.

As to defaults, you can declare those in the requestHandler
declarations in solrconfig.xml. Examples of this (search for
defaults) in the wiki page.

On Tue, Mar 16, 2010 at 7:05 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote:
 Hi,

 According to the wiki its possible to pass parameters to the DIH:
 http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

 I assume they are just being replaced via simple string replacements, which 
 is exactly what I need. Can they also be in all places, even attributes (for 
 example to pass in the password)?

 Furthermore is there some way to define default values for these request 
 parameters in case no value is passed in?

 regards,
 Lukas Kahwe Smith
 m...@pooteeweet.org







-- 
Lance Norskog
goks...@gmail.com

RE: PDFBox/Tika Performance Issues

2010-03-16 Thread Giovanni Fernandez-Kincade

I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. 
This is what I've tried so far (which was really just me guessing):



1. Got the latest version of the trunk code from 
http://svn.apache.org/repos/asf/lucene/tika/trunk

2. Built this using Maven (mvn install)

3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib 
folder for my Solr Core, and renamed it to the name of the existing Tika Jar 
(tika-0.3.jar).

4. Then I bounced my servlet server and tried indexing a document. The 
document was successfully indexed, and there were no errors logged as a result, 
but the PDF data does not appear to have been extracted (the field I used for 
map.content had an empty-string as a value).



What's the right approach to perform this patch?





-Original Message-
From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
Sent: Tuesday, March 16, 2010 5:41 PM
To: solr-user@lucene.apache.org
Subject: RE: PDFBox/Tika Performance Issues



Thanks Chris!



I'll try the patch.



-Original Message-

From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]

Sent: Tuesday, March 16, 2010 5:37 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 
depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may 
include a fix for the problem you're seeing.



See this discussion [2] on how to patch Tika to use the new PDFBox if you can't 
wait for the 0.7 release which should happen soon (hopefully next few weeks).



Cheers,

Chris



[1] http://issues.apache.org/jira/browse/TIKA-380

[2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html





On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade 
gfernandez-kinc...@capitaliq.com wrote:



Originally 16 (the number of CPUs on the machine), but even with 5 threads it's 
not looking so hot.



-Original Message-

From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll

Sent: Tuesday, March 16, 2010 5:15 PM

To: solr-user@lucene.apache.org

Subject: Re: PDFBox/Tika Performance Issues



Hmm, that is an ugly thing in PDFBox.  We should probably take this over to the 
PDFBox project.  How many threads are you indexing with?



FWIW, for that many documents, I might consider using Tika on the client side 
to save on a lot of network traffic.



-Grant



On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:



 I've been trying to bulk index about 11 million PDFs, and while profiling our 
 Solr instance, I noticed that all of the threads that are processing indexing 
 requests are constantly blocking each other during this call:



 http-8080-Processor39 [BLOCKED] CPU time: 9:35

 java.util.Collections$SynchronizedMap.get(Object)

 org.pdfbox.pdmodel.font.PDFont.getAFM()

 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)

 org.pdfbox.util.PDFStreamEngine.showString(byte[])

 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)

 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)

 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, 
 COSStream)

 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)

 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)

 org.pdfbox.util.PDFTextStripper.processPages(List)

 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)

 org.pdfbox.util.PDFTextStripper.getText(PDDocument)

 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
 Metadata)

 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
 Metadata)

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
  SolrQueryResponse, ContentStream)

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
  SolrQueryResponse)

 org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
 SolrQueryResponse)

 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
  SolrQueryResponse)

 org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
 SolrQueryResponse)

 org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
 SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
 ServletResponse, FilterChain)

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
  ServletResponse)

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
 ServletResponse)

 org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)

Undefined field price on Dismax query

2010-03-16 Thread Alex Thurlow


Hi guys,
Based on some suggestions, I'm trying to use the dismax query 
type.  I'm getting a weird error though that I think it related to the 
default test data set.


From the query tool (/solr/admin/form.jsp), I put in this:
Statement: artist:test title:test +type:video
query type: dismax

The rest is left as defaults.  I get this error page:
HTTP ERROR: 400
undefined field price

RequestURI=/solr/select

I am running out of the example dir still, but I made my own custom 
schema and deleted the index before inserting my new data.  Am I missing 
something that needs to be cleared?  Query type=standard works fine here.


Thanks,
Alex

Re: Moving From Oracle Text Search To Solr

The DataImportHandler has tools for this. It will fetch rows from
Oracle and allow you to unpack columns as XML with  Xpaths.

http://wiki.apache.org/solr/DataImportHandler
http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS
http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor

On Tue, Mar 16, 2010 at 2:25 PM, Neil Chaudhuri
nchaudh...@potomacfusion.com wrote:
 That is a great article, David.

 For the moment, I am trying an all-Solr approach, but I have run into a small 
 problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. 
 Is there any facility to unpack this into the actual text? Or must I execute 
 that in the SQL query?

 Thanks.


 -Original Message-
 From: Smiley, David W. [mailto:dsmi...@mitre.org]
 Sent: Tuesday, March 16, 2010 4:45 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 If you do stay with Oracle, please report back to the list how that went.  In 
 order to get decent filtering and faceting performance, I believe you will 
 need to use bitmapped indexes which Oracle and some other databases support.

 You may want to check out my article on this subject: 
 http://www.packtpub.com/article/text-search-your-database-or-solr

 ~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


 On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted 
 results, but I am not sure of the flexibility, extensibility, or scalability 
 of that approach. And from what I have read, Oracle Text doesn't do faceting 
 out of the box.

 Each document is a few MB, and there will be millions of them. I suppose it 
 depends on how I index them. I am pretty sure my current approach of using 
 Hibernate to load all rows, constructing Solr POJO's from them, and then 
 passing the POJO's to the embedded server would lead to a OOM error. I 
 should probably look into the other options.

 Thanks.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

 I am working on an application that currently hits a database containing
 millions of very large documents. I use Oracle Text Search at the moment,
 and things work fine. However, there is a request for faceting capability,
 and Solr seems like a technology I should look at. Suffice to say I am new
 to Solr, but at the moment I see two approaches-each with drawbacks:


 1)      Have Solr index document metadata (id, subject, date). Then Use
 Oracle Text to do a content search based on criteria. Finally, query the
 Solr index for all documents whose id's match the set of id's returned by
 Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
 id:4ORid:33432323OR...).

 2)      Remove Oracle Text from the equation and use Solr to query document
 content based on search criteria. The indexing process though will almost
 certainly encounter an OutOfMemoryError given the number and size of
 documents.



 I am using the embedded server and Solr Java APIs to do the indexing and
 querying.



 I would welcome your thoughts on the best way to approach this situation.
 Please let me know if I should provide additional information.



 Thanks.









-- 
Lance Norskog
goks...@gmail.com

Indexing CLOB Column in Oracle

2010-03-16 Thread Neil Chaudhuri

Since my original thread was straying to a new topic, I thought it made sense 
to create a new thread of discussion.

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type, which is an instance of 
oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.

So in my db-data-config, I have the following:

document
entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID FROM DOC 
d
field column=EFFECTIVE_DT name=effectiveDate /
field column=ARCHIVE_ID name=id /
entity name=text query=SELECT d.XML FROM DOC d WHERE 
d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' transformer=ClobTransformer
field column=XML name=text clob=true sourceColName=XML 
/
/entity
/entity
/document

Meanwhile, I have this in schema.xml:

field name=text type=text_ws indexed=true stored=true 
multiValued=true omitNorms=false termVectors=true /

However, when I take a look at my indexes with Luke, I find that the items 
labeled text simply say oracle.sql.OPAQUE and a bunch of numbers-in other 
words, the OPAQUE.toString().

Can you give me some insight into where I am going wrong?

Thanks.

Re: Trouble Implementing Extracting Request Handler

2010-03-16 Thread Steve Reichgut


Lance,

I tried that but no luck. Just in case the relative paths were causing a 
problem, I also tried using absolute paths but neither seemed to help. 
First, I tried adding *lib dir=/path/to/example/solr/lib /* as the 
full directory so it would hopefully include everything. When that 
didn't work, I tried adding paths directly to the two Tika jar files in 
the Lib directory like this:

*lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and
*lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /*

Am I including them incorrectly somehow?

Steve

On 3/16/2010 3:38 PM, Lance Norskog wrote:

NoClassDefFoundError usually means that the class was found, but it
needs other classes and those were not found. That is, Solr finds the
ExtractingRequestHandler jar but cannot find the Tika jars.

In example/solr/conf/slrconfig.xml, there are several 'lib
dir=path/' elements. These give classpath directories and jar files
to include when loading classes (and resource files). Try adding the
paths for your Tika jars aslib/  directives.

On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com  wrote:
   

Sure. I've attached two docs that have the stack trace and the full list of
.jar files.

On 3/15/2010 8:34 PM, Lance Norskog wrote:
 

Please post the complete stack trace. Also, it will help if you make a
full listing of all .jar files in the example/ directory.

On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:

   

Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
follow-on error though. It is giving the following error:
ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

Did we miss something else in the setup?

Steve

Is there something else we haven't copied

On 3/15/2010 6:12 PM, Lance Norskog wrote:

 

This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

The ExtractingRequestHandler libraries are in contrib/extracting/lib

You need to make a directory example/solr/lib and copy into it the
apache-solr-cell jar from dist/ and all of the libraries from
contrib/extracting/lib. The Wiki page has not been updated for the
Solr 1.4 release. I just added a TODO to this effect.

On 3/12/10, Steve Reichgutsreich...@axtaweb.com  wrote:


   

Hi Grant,
Thanks for the feedback. In reading the Wiki, it recommended that you
copy everything from example/solr/libs directory into a /libs directory
in your instance. I went into my example/solr directory and only see
two
directories - bin and conf. There is no libs directory. Where
else
can I get the contents of what should be in libs?

Steve

On 3/12/2010 2:15 PM, Grant Ingersoll wrote:


 

On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:




   

Now that I have configured my Solr instance for standard indexing, I
wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
it
with a simple PDF file, I got the following error:

org.apache.solr.common.SolrException: lazy loading error
Caused by: org.apache.solr.common.SolrException: Error loading
class
'org.apache.solr.handler.extraction.ExtractingRequestHandler'

Based on the error, it appeared that the problem is caused by certain
components not being installed or installed correctly. Since I am not
a
Java guy, I had my Java person try to install the
ExtractingRequestHandler to no avail. He had said that he was having
real
trouble finding good documentation on how to install and enable this
handler.

Could anyone point me to good documentation on how to
install/troubleshoot this?



 

http://wiki.apache.org/solr/ExtractingRequestHandler

Essentially, you need to make sure the ERH stuff is in Solr/lib before
starting.

-Grant

Re: Indexing CLOB Column in Oracle

2010-03-16 Thread Shawn Heisey

Disclaimer:  My Oracle experience is miniscule at best.  I am also a 
beginner at Solr, so grab yourself the proverbial grain of salt.


I googled a bit on CLOB.  One page I found mentioned setting up a view 
to return the data type you want.  Can you use the functions described 
on these pages in either the Solr query or a view?


http://www.oradev.com/dbms_lob.jsp
http://www.dba-oracle.com/t_dbms_lob.htm
http://www.praetoriate.com/dbms_packages/ddp_dbms_lob.htm

I also was trying to find a way to convert from xmltype directly to a 
string in a query, but that quickly got way over my level of 
understanding.  I saw hints that it is possible, though.


Shawn

On 3/16/2010 4:59 PM, Neil Chaudhuri wrote:

Since my original thread was straying to a new topic, I thought it made sense 
to create a new thread of discussion.

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type, which is an instance of 
oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.

Re: Solr RAM Requirements

2010-03-16 Thread Peter Sturge

On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati jimmoe...@gmail.comwrote:

Hey,
I am trying to understand what kind of calculation I should do in order to
come up with reasonable RAM size for a given solr machine.

Suppose the index size is at 16GB.
The Max heap allocated to JVM is about 12GB.

The machine I'm trying now has 24GB.
When the machine is running for a while serving production, I can see in
top
that the resident memory taken by the jvm is indeed at 12gb.
Now, on top of this i should assume that if i want the whole index to fit
in
disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind
of calculation correct or am i off here?

Hmmm..not quite. The idea of the ram usage isn't to simply hold the index in
memory - if you want this use a RAMDirectory.
The memory being used will be a combination of various caches (Lucene and
Solr), index buffers et al., and of course the server itself. The specifics
depend very
much on what your server is doing at any given time - e.g. lots of
concurrent searches, lots of indexing, both etc., and how things are setup
in your solrconfig.xml.

A really excellent resource that's worth looking at regarding all this can
be found here:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

Any other recommendations Anyone could make w.r.t these numbers ?

Thanks,
-Chak
--
View this message in context:
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Undefined field price on Dismax query

I suspect your problem is that you still have price defined in

solrconfig.xml for the dismax handler. Look for the section
requestHandler name=dismax..

You'll see price defined as one of the default fields for fl and bf.

HTH
Erick

On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlow a...@blastro.com wrote:

 Hi guys,
Based on some suggestions, I'm trying to use the dismax query type.  I'm
 getting a weird error though that I think it related to the default test
 data set.

 From the query tool (/solr/admin/form.jsp), I put in this:
 Statement: artist:test title:test +type:video
 query type: dismax

 The rest is left as defaults.  I get this error page:
 HTTP ERROR: 400
 undefined field price

 RequestURI=/solr/select

 I am running out of the example dir still, but I made my own custom
 schema and deleted the index before inserting my new data.  Am I missing
 something that needs to be cleared?  Query type=standard works fine here.

 Thanks,
 Alex

Re: Moving From Oracle Text Search To Solr

Besides the other notes here, I agree you'll hit OOM if you try to
read all the rows into memory at once, but I'm absolutely sure you
can read then N at a time instead. Not that I could tell you how, mind
you.

You're on your way...
Erick

On Tue, Mar 16, 2010 at 4:13 PM, Neil Chaudhuri 
nchaudh...@potomacfusion.com wrote:

 Certainly I could use some basic SQL count(*) queries to achieve faceted
 results, but I am not sure of the flexibility, extensibility, or scalability
 of that approach. And from what I have read, Oracle Text doesn't do faceting
 out of the box.

 Each document is a few MB, and there will be millions of them. I suppose it
 depends on how I index them. I am pretty sure my current approach of using
 Hibernate to load all rows, constructing Solr POJO's from them, and then
 passing the POJO's to the embedded server would lead to a OOM error. I
 should probably look into the other options.

 Thanks.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Tuesday, March 16, 2010 3:58 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Moving From Oracle Text Search To Solr

 Why do you think you'd hit OOM errors? How big is very large? I've
 indexed, as a single document, a 26 volume encyclopedia of civil war
 records..

 Although as much as I like the technology, if I could get away without
 using
 two technologies, I would. Are you completely sure you can't get what you
 want with clever Oracle querying?

 Best
 Erick

 On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri 
 nchaudh...@potomacfusion.com wrote:

  I am working on an application that currently hits a database containing
  millions of very large documents. I use Oracle Text Search at the moment,
  and things work fine. However, there is a request for faceting
 capability,
  and Solr seems like a technology I should look at. Suffice to say I am
 new
  to Solr, but at the moment I see two approaches-each with drawbacks:
 
 
  1)  Have Solr index document metadata (id, subject, date). Then Use
  Oracle Text to do a content search based on criteria. Finally, query the
  Solr index for all documents whose id's match the set of id's returned by
  Oracle Text. That strikes me as an unmanageable Boolean query.  (e.g.
  id:4ORid:33432323OR...).
 
  2)  Remove Oracle Text from the equation and use Solr to query
 document
  content based on search criteria. The indexing process though will almost
  certainly encounter an OutOfMemoryError given the number and size of
  documents.
 
 
 
  I am using the embedded server and Solr Java APIs to do the indexing and
  querying.
 
 
 
  I would welcome your thoughts on the best way to approach this situation.
  Please let me know if I should provide additional information.
 
 
 
  Thanks.

Re: Undefined field price on Dismax query

2010-03-16 Thread Alex Thurlow

Aha.  That appears to be the issue.  I hadn't realized that the query 
handler had all of those definitions there.


-Alex


On 3/16/2010 6:56 PM, Erick Erickson wrote:

I suspect your problem is that you still have price defined in

solrconfig.xml for the dismax handler. Look for the section
requestHandler name=dismax..

You'll see price defined as one of the default fields for fl and bf.

HTH
Erick

On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlowa...@blastro.com  wrote:

   

Hi guys,
Based on some suggestions, I'm trying to use the dismax query type.  I'm
getting a weird error though that I think it related to the default test
data set.

 From the query tool (/solr/admin/form.jsp), I put in this:
Statement: artist:test title:test +type:video
query type: dismax

The rest is left as defaults.  I get this error page:
HTTP ERROR: 400
undefined field price

RequestURI=/solr/select

I am running out of the example dir still, but I made my own custom
schema and deleted the index before inserting my new data.  Am I missing
something that needs to be cleared?  Query type=standard works fine here.

Thanks,
Alex

Solr query parser doesn't invoke analyzer for simple term query?

2010-03-16 Thread Teruhiko Kurosaka

It seems that Solr's query parser doesn't pass a single term query
to the Analyzer for the field. For example, if I give it
2001年 (year 2001 in Japanese), the searcher returns 0 hits 
but if I quote them with double-quotes, it returns hits. 
In this experiment, I configured schema.xml so that
the field in question will use the morphological Analyzer 
my company makes that is capable of splitting 2001年  
into two tokens 2001 and 年.  I am guessing that this
Analyzer is called ONLY IF the term is a phrase.
Is my observation correct?

If so, is there any configuration parameter that I can tweak 
to force any query for the text fields be processed by 
the Analyzer?

One might ask why users won't put space between 2001 and 年.
Well if they are clearly two separate words, people do that.
But 年 works more like a suffix in this case, and in many
Japanese speaker's mind, 2001年 seems like one token, so
many people won't.  (Remember Japanese don't use spaces
in normal writing.)  Forcing to use Analyzer would also
be useful for compound word handling often desirable
for languages like German.


Teruhiko Kuro Kurosaka
RLP + Lucene  Solr = powerful search for global contents

problem during benchmarking solr query

2010-03-16 Thread KshamaPai


Hi,
Am using autobench to benchmark solr with the query
http://localhost:8983/solr/select/?q=body:hotel AND
_val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

But if i specify the same in the autobench command as
autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20
--host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10
--uri1 /solr/select/?q=body:hotel AND  
_val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

it is taking body:hotel as uri but not _val_ part ,which i think is because
of the space after hotel. Even if i try  escaping  this in autobench using
'\' it ll give parse error in solr.

Can any one suggest me how do i handle this?so that entire query is
considered as uri  and also solr respond with appropriate reply.
thank you.
 

-- 
View this message in context: 
http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr RAM Requirements

2010-03-16 Thread Peter Sturge

There are certainly a number of widely varying opinions on the use of RAM
directory.
Basically, though, if you need the index to be persistent at some point
(i.e. saved across reboots, crashes etc.),
you'll need to write to a disk, so RAM directory becomes somewhat
superfluous in this case.

Generally, good hardware and fast disks are a better bet, since you'll
probably want to have them anyway :-)

From my own experiences with varying types/sizes of indexes, and the general
wisdom gleamed from the experts, the amount of memory required for a given
environment is very much
a 'how long is a piece of string' type of scenario. It depends on so many
factors that it's impractical to come up with a easy 'standardized' formula.

What I've found useful as a rough guidance (in additon to the very useful
URL I mentioned earlier), is if your server is doing lots of indexing and
not much searching, you want your os fs cache to have access to a healthy
amount of memory.
If you're doing lots of searching/reading (and particularly faceting),
you'll want a good amount of ram for Solr/Lucene caching (which caches need
what depends on the type of data you're searching).
If you have a server that is doing a lot of both indexing and searching, you
should consider breaking them out using replication and possibly using load
balancers (if you have lots of concurrent querying going on).

It stands to reason that the bigger the index gets, the more memory will
generally be required for working on various aspects of it. When you get
into very large indexes, it becomes more efficient to distribute the
indexing across servers (and replicating those servers), so that no single
machine has huge cache lists to traverse. Again, the 'Scaling Lucene and
Solr' page goes into these scenarios and is well worth studying.

On Wed, Mar 17, 2010 at 12:29 AM, KaktuChakarabati jimmoe...@gmail.comwrote:

Hey Peter,
Thanks for your reply.
My question was mainly about the fact there seems to be two different
aspects to the solr RAM usage: in-process and out-process.
By that I mean, yes i know the many different parameters/caches to do with
solr in-process memory usage and related culprits, however I also
understand
that as for actual index access (posting list, positional index etc), solr
mostly delegates the access/caching of this to the OS/disk cache.
So I guess my question is more about that: namely, what would be a good way
to calculate an overall ram requirement profile for a server running solr?
Also, I was under the impression benefits from RAMDirectory would be
minimal
when disk caches are effective no?
And does RAMDirectory work with replication? if so, doesnt it slow it down?
( on each replication, load up entire index to RAM at once? )

Peter Sturge wrote:

On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati
jimmoe...@gmail.comwrote:

Hey,
I am trying to understand what kind of calculation I should do in order
to
come up with reasonable RAM size for a given solr machine.

Suppose the index size is at 16GB.
The Max heap allocated to JVM is about 12GB.

The machine I'm trying now has 24GB.
When the machine is running for a while serving production, I can see in
top
that the resident memory taken by the jvm is indeed at 12gb.
Now, on top of this i should assume that if i want the whole index to
fit
in
disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this
kind
of calculation correct or am i off here?

Hmmm..not quite. The idea of the ram usage isn't to simply hold the index
in
memory - if you want this use a RAMDirectory.
The memory being used will be a combination of various caches (Lucene and
Solr), index buffers et al., and of course the server itself. The
specifics
depend very
much on what your server is doing at any given time - e.g. lots of
concurrent searches, lots of indexing, both etc., and how things are
setup
in your solrconfig.xml.

A really excellent resource that's worth looking at regarding all this
can
be found here:

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

Any other recommendations Anyone could make w.r.t these numbers ?

Thanks,
-Chak
--
View this message in context:
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html
Sent from the Solr - User mailing list archive at Nabble.com.

--
View this message in context:
http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27926536.html
Sent from the Solr - User mailing list archive at Nabble.com.

Stopwords


I was reading Scaling Lucen and Solr
(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
and I came across the section StopWords. 

In there it mentioned that its not recommended to remove stop words at index
time. Why is this the case? Don't all the extraneous stopwords bloat the
index and lead to less relevant results? Can someone please explain this to
me. Thanks
-- 
View this message in context: 
http://old.nabble.com/Stopwords-tp27927028p27927028.html
Sent from the Solr - User mailing list archive at Nabble.com.

APR setup


[java] INFO: The APR based Apache Tomcat Native library which allows optimal
performance in production environments was not found on the
java.library.path:
.:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java

What the heck is this and why is it recommended for production settings?
Anyone?

-- 
View this message in context: 
http://old.nabble.com/APR-setup-tp27927553p27927553.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Trouble Implementing Extracting Request Handler

org/apache/solr/util/plugin/SolrCoreAware in the stack trace refers to
an interface in the main Solr jar.

I think this means that putting all of the libs in
apache-tomcat-6.0.20/lib is a mistake: the classloader finds
ExtractingRequestHandler in
apache-tomcat-6.0.20/lib/apache-solr-cell-1.4.1-dev.jar, but that it
wants the above interface. The main Solr jar is not available somehow.
Since the solr-cell jar is in multiple places, we don't know exactly
how Tomcat finds it.

I suggest that you go back to a clean, empty Tomcat, and the original
Solr distribution. Copy the solr war file to the right directory in
Tomcat. Get Solr talking to your solr/ directory
(-Dsolr.solr.home=path). Now, check if the lib directives in the
solrconfig.xml are right.



On Tue, Mar 16, 2010 at 4:19 PM, Steve Reichgut sreich...@axtaweb.com wrote:
 Lance,

 I tried that but no luck. Just in case the relative paths were causing a
 problem, I also tried using absolute paths but neither seemed to help.
 First, I tried adding *lib dir=/path/to/example/solr/lib /* as the full
 directory so it would hopefully include everything. When that didn't work, I
 tried adding paths directly to the two Tika jar files in the Lib directory
 like this:
 *lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and
 *lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /*

 Am I including them incorrectly somehow?

 Steve

 On 3/16/2010 3:38 PM, Lance Norskog wrote:

 NoClassDefFoundError usually means that the class was found, but it
 needs other classes and those were not found. That is, Solr finds the
 ExtractingRequestHandler jar but cannot find the Tika jars.

 In example/solr/conf/slrconfig.xml, there are several 'lib
 dir=path/' elements. These give classpath directories and jar files
 to include when loading classes (and resource files). Try adding the
 paths for your Tika jars aslib/  directives.

 On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:


 Sure. I've attached two docs that have the stack trace and the full list
 of
 .jar files.

 On 3/15/2010 8:34 PM, Lance Norskog wrote:


 Please post the complete stack trace. Also, it will help if you make a
 full listing of all .jar files in the example/ directory.

 On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com
  wrote:



 Thanks Lance. That helped ( we are using Solr-1.4). We've run into a
 follow-on error though. It is giving the following error:
 ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware

 Did we miss something else in the setup?

 Steve

 Is there something else we haven't copied

 On 3/15/2010 6:12 PM, Lance Norskog wrote:



 This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk.

 The ExtractingRequestHandler libraries are in contrib/extracting/lib

 You need to make a directory example/solr/lib and copy into it the
 apache-solr-cell jar from dist/ and all of the libraries from
 contrib/extracting/lib. The Wiki page has not been updated for the
 Solr 1.4 release. I just added a TODO to this effect.

 On 3/12/10, Steve Reichgutsreich...@axtaweb.com      wrote:




 Hi Grant,
 Thanks for the feedback. In reading the Wiki, it recommended that you
 copy everything from example/solr/libs directory into a /libs
 directory
 in your instance. I went into my example/solr directory and only see
 two
 directories - bin and conf. There is no libs directory. Where
 else
 can I get the contents of what should be in libs?

 Steve

 On 3/12/2010 2:15 PM, Grant Ingersoll wrote:




 On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote:






 Now that I have configured my Solr instance for standard indexing,
 I
 wanted to start indexing PDF's, MS Doc's, etc. When I tried to test
 it
 with a simple PDF file, I got the following error:

    org.apache.solr.common.SolrException: lazy loading error
    Caused by: org.apache.solr.common.SolrException: Error loading
 class
    'org.apache.solr.handler.extraction.ExtractingRequestHandler'

 Based on the error, it appeared that the problem is caused by
 certain
 components not being installed or installed correctly. Since I am
 not
 a
 Java guy, I had my Java person try to install the
 ExtractingRequestHandler to no avail. He had said that he was
 having
 real
 trouble finding good documentation on how to install and enable
 this
 handler.

 Could anyone point me to good documentation on how to
 install/troubleshoot this?





 http://wiki.apache.org/solr/ExtractingRequestHandler

 Essentially, you need to make sure the ERH stuff is in Solr/lib
 before
 starting.

 -Grant

























-- 
Lance Norskog
goks...@gmail.com

spanish solr tutorial

2010-03-16 Thread Juan Pedro Danculovic

Hi all, we translated the Solr tutorial to Spanish due to a client's
request. For all you Spanish speakers/readers out there, you can have a look
at it:

http://www.linebee.com/?p=155

We hope this can expand the usage of the project and lower the language
barrier to non-english speakers.

Thanks

Juan Danculovic
CTO - www.linebee.com

Re: APR setup

That would be a Tomcat question :)

On Tue, Mar 16, 2010 at 8:36 PM, blargy zman...@hotmail.com wrote:

 [java] INFO: The APR based Apache Tomcat Native library which allows optimal
 performance in production environments was not found on the
 java.library.path:
 .:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java

 What the heck is this and why is it recommended for production settings?
 Anyone?

 --
 View this message in context: 
 http://old.nabble.com/APR-setup-tp27927553p27927553.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com

Re: problem during benchmarking solr query

Use a + sign or %20 for the space. The URL standard uses a plus to mean a space.

On Tue, Mar 16, 2010 at 6:06 PM, KshamaPai kshamapai2...@gmail.com wrote:

 Hi,
 Am using autobench to benchmark solr with the query
 http://localhost:8983/solr/select/?q=body:hotel AND
 _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

 But if i specify the same in the autobench command as
 autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20
 --host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10
 --uri1 /solr/select/?q=body:hotel AND
 _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100

 it is taking body:hotel as uri but not _val_ part ,which i think is because
 of the space after hotel. Even if i try  escaping  this in autobench using
 '\' it ll give parse error in solr.

 Can any one suggest me how do i handle this?so that entire query is
 considered as uri  and also solr respond with appropriate reply.
 thank you.


 --
 View this message in context: 
 http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html
 Sent from the Solr - User mailing list archive at Nabble.com.





-- 
Lance Norskog
goks...@gmail.com

Re: PDFBox/Tika Performance Issues

2010-03-16 Thread Mattmann, Chris A (388J)

Hi Giovanni,

Comments below:

 I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
 This is what I've tried so far (which was really just me guessing):
 
 
 
 1. Got the latest version of the trunk code from
 http://svn.apache.org/repos/asf/lucene/tika/trunk
 
 2. Built this using Maven (mvn install)
 

On track so far.

 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
 folder for my Solr Core, and renamed it to the name of the existing Tika Jar
 (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

 
 4. Then I bounced my servlet server and tried indexing a document. The
 document was successfully indexed, and there were no errors logged as a
 result, but the PDF data does not appear to have been extracted (the field I
 used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

 -Original Message-
 From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
 Sent: Tuesday, March 16, 2010 5:41 PM
 To: solr-user@lucene.apache.org
 Subject: RE: PDFBox/Tika Performance Issues
 
 
 
 Thanks Chris!
 
 
 
 I'll try the patch.
 
 
 
 -Original Message-
 
 From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
 
 Sent: Tuesday, March 16, 2010 5:37 PM
 
 To: solr-user@lucene.apache.org
 
 Subject: Re: PDFBox/Tika Performance Issues
 
 
 
 Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
 include a fix for the problem you're seeing.
 
 
 
 See this discussion [2] on how to patch Tika to use the new PDFBox if you
 can't wait for the 0.7 release which should happen soon (hopefully next few
 weeks).
 
 
 
 Cheers,
 
 Chris
 
 
 
 [1] http://issues.apache.org/jira/browse/TIKA-380
 
 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
 
 
 
 
 
 On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade
 gfernandez-kinc...@capitaliq.com wrote:
 
 
 
 Originally 16 (the number of CPUs on the machine), but even with 5 threads
 it's not looking so hot.
 
 
 
 -Original Message-
 
 From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll
 
 Sent: Tuesday, March 16, 2010 5:15 PM
 
 To: solr-user@lucene.apache.org
 
 Subject: Re: PDFBox/Tika Performance Issues
 
 
 
 Hmm, that is an ugly thing in PDFBox.  We should probably take this over to
 the PDFBox project.  How many threads are you indexing with?
 
 
 
 FWIW, for that many documents, I might consider using Tika on the client side
 to save on a lot of network traffic.
 
 
 
 -Grant
 
 
 
 On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote:
 
 
 
 I've been trying to bulk index about 11 million PDFs, and while profiling our
 Solr instance, I noticed that all of the threads that are processing indexing
 requests are constantly blocking each other during this call:
 
 
 
 http-8080-Processor39 [BLOCKED] CPU time: 9:35
 
 java.util.Collections$SynchronizedMap.get(Object)
 
 org.pdfbox.pdmodel.font.PDFont.getAFM()
 
 org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int)
 
 org.pdfbox.util.PDFStreamEngine.showString(byte[])
 
 org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List)
 
 org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List)
 
 org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources,
 COSStream)
 
 org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream)
 
 org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream)
 
 org.pdfbox.util.PDFTextStripper.processPages(List)
 
 org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer)
 
 org.pdfbox.util.PDFTextStripper.getText(PDDocument)
 
 org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler,
 Metadata)
 
 org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler,
 Metadata)

Re: field length normalization