Re: Problem with suggest search
Thank you. This work good as workaround. Yesterday I get the Tipp to look for wrong solrconfig.xml and that was right. By uploading our Files the solrconfig.xml was LOST ;-) Is it possible to start Java in Debugmode for more Infos? David Am 16.03.2010 02:02, schrieb Tom Hill: You need a query string with the standard request handler. (dismax has q.alt) Try q=*:*, if you are trying to get facets for all documents. And yes, a friendlier error message would be a good thing. Tom On Mon, Mar 15, 2010 at 9:03 AM, David Rührd...@marketing-factory.de wrote: Hi List. We have two Servers dev and live. Dev is not our Problem but on live we see with the facet.prefix paramter - if there is no q param - for suggest search this error: HTTP Status 500 - null java.lang.NullPointerException at java.io.StringReader.init(StringReader.java:54) at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197) at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78) at org.apache.solr.search.QParser.getQuery(QParser.java:137) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:85) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1313) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:849) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454) at java.lang.Thread.run(Thread.java:811) The Query looks like: facet=onfacet.mincount=1facet.limit=10json.nl =mapwt=jsonrows=0version=1.2omitHeader=truefl=contentstart=0q=facet.prefix=matefacet.field=contentfq=group:0+OR+group:-2+OR+group:1+OR+group:11+-group:-1fq=language:0 When we add the q param f.e. q=material we have no error. Anyone have the same error or can help? Thanks to all. David Mit freundlichen Grüßen, David Rühr PHP Programmierer -- Marketing Factory Consulting GmbH* mailto:d...@marketing-factory.de Stephanienstraße 36 * Tel.: +49 211-361176-58 D-40211 Düsseldorf, Germany * Fax: +49 211-361176-99 Amtsgericht Düsseldorf HRB 53971 * http://www.marketing-factory.de/ Geschäftsführer:Peter Faisst | Katja Faisst Karoline Steinfatt | Christoph Allefeld | Markus M. Kimmel
Re: AutoSuggest
Shalin Shekhar Mangar wrote: On Sat, Mar 13, 2010 at 9:30 AM, Suram reactive...@yahoo.com wrote: Erick Erickson wrote: Did you commit your changes? Erick On Fri, Mar 12, 2010 at 7:38 AM, Suram reactive...@yahoo.com wrote: Can set my index fields for auto Suggestion, sometime the newly index field not found for auto suggestion and index search -- View this message in context: http://old.nabble.com/AutoSuggest-tp27874542p27874542.html Sent from the Solr - User mailing list archive at Nabble.com. ya obviously i commit the changes.but it won't suggest How are you trying to do the auto-suggest? Paste your field's and type's schema definition as well as the Solr URL you are hitting. -- Regards, Shalin Shekhar Mangar. Hi shalin, here attached my schema.xml http://old.nabble.com/file/p27916777/schema.xml schema.xml Hitting query is http://localhost:8080/solr/core0/terms?terms.fl=nameterms.prefix=bomitHeader=true -- View this message in context: http://old.nabble.com/AutoSuggest-tp27874542p27916777.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to get Term Positions?
If you're going to spend time mucking w/ TermPositions, you should just spend your time working with SpanQuery, as that is what I understand you to be asking about. AIUI, you want to be able to get at the positions in the document where the query matched. This is exactly what a SpanQuery and it's derivatives does. It does all the work that you would have to do yourself by using the TermPositions class. On Mar 12, 2010, at 6:38 PM, MitchK wrote: Thank you both for your responses. However, I am not familiar enough with Solr and even not with Lucene. So, at the moment, I have no real idea of what payloads are (I can't even translate this word...). The manual says something about metadata - but there is nothing said about what metadata they mean. I think that - looking at my little experiences with Lucene and Solr - it would be a better idea to firstly read some stuff like Lucene in Action, before tryring to customize (or contribute to) Lucene/Solr at such a level. Do they currently work on the tickets? It seems like there was no more time to do so?? Last but not least: I want to add something productive to my question: The paper that maybe describes the solution for my problem... http://lucene.apache.org/java/3_0_1/fileformats.html#Positions To quote: PositionDelta is, if payloads are disabled for the term's field, the difference between the position of the current occurrence in the document and the previous occurrence (or zero, if this is the first occurrence in this document). If I could retrive the given information, this would be great - even if it forces me to iterate over the document where the term occurs. Lucene's TermPositions-Class seems to be a good place to start, doesn't it??? What do you think? [1] Integrating some Lucene-based work to Solr is another question...I think one needs to have a map, where one can see which class is usually called by which class, but that is really another topic :). [1] http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/store/instantiated/InstantiatedTermPositions.html Thank you! - Mitch -- View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27884130.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Spatial search in Solr 1.5
On Mar 15, 2010, at 11:36 AM, Jean-Sebastien Vachon wrote: Hi All, I'm trying to figure out how to perform spatial searches using Solr 1.5 (from the trunk). Is the support for spatial search built-in? Almost. Main thing missing right now is filtering. There are still ways to do spatial filtering, but it isn't complete yet. In the meantime, range queries and or frange might help. because none of the patches I tried could be applied to the source tree. If this is the case, can someone one tell me how to configure it? http://wiki.apache.org/solr/SpatialSearch has most of the docs, but they aren't complete yet. Here's what I would do: Check out latest Solr Build the example: ant clean example Start the example: cd example; java -jar start.jar Rebuild the index: cd exampledocs; java -jar post.jar *.xml Run a query: http://localhost:8983/solr/select/?q=_val_:recip(dist(2, store, vector(34.0232,-81.0664)),1,1,0)fl=*,score // Note, I just updated this, it used to be point instead of vector and that was wrong. Next, have a look at the docs in exampledocs and specifically the store field, which contains the location. Then go check out the schema for that field. HTH, Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
solr.WordDelimiterFilterFactory problem with hyphenated terms?
This is my first post on this list -- apologies if this has been discussed before; I didn't come upon anything exactly equivalent in searching the archives via Google. I'm using Solr 1.4 as part of the VuFind application, and I just noticed that searches for hyphenated terms are failing in strange ways. I strongly suspect it has something to do with the solr.WordDelimiterFilterFactory filter, but I'm not exactly sure what. The problem is that I have a record with the title Love customs in eighteenth-century Spain. Depending on how I search for this, I get successes or failures in a seemingly unpredictable pattern. Demonstration queries below were tested using the direct Solr administration tool, just to eliminate any VuFind-related factors from the equation while debugging. Queries that work: title:(Love customs in eighteenth century Spain) // no hyphen, no phrases title:(Love customs in eighteenth-century Spain) // phrase search on whole title, with hyphen Queries that fail: title:(Love customs in eighteenth-century Spain) // hyphen, no phrases title:(Love customs in eighteenth century Spain) // phrase search on whole title, without hyphen title:(Love customs in eighteenth-century Spain) // hyphenated word as phrase title:(Love customs in eighteenth century Spain) // hyphenated word as phrase, hyphen removed Here is VuFind's text field type definition: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=schema.UnicodeNormalizationFilterFactory version=icu4j composed=false remove_diacritics=true remove_modifiers=true fold=true/ filter class=solr.ISOLatin1AccentFilterFactory/ /analyzer /fieldType I did notice that in the text field type in VuFind's schema has catenateWords and catenateNumbers turned on in both the index and query analyzer chains. It is my understanding that these options should be disabled for the query chain and only enabled for the index chain. However, this may be a red herring -- I have already tried changing this setting, but it didn't change the success/failure pattern described above. I have also played with the preserveOriginal setting without apparent effect. From playing with the Field Analysis tool, I notice that there is a gap in the term position sequence after analysis... but I'm not sure if this is significant. Has anybody else run into this sort of problem? Any ideas on a fix? thanks, Demian
DIH request parameters
Hi, According to the wiki its possible to pass parameters to the DIH: http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters I assume they are just being replaced via simple string replacements, which is exactly what I need. Can they also be in all places, even attributes (for example to pass in the password)? Furthermore is there some way to define default values for these request parameters in case no value is passed in? regards, Lukas Kahwe Smith m...@pooteeweet.org
SQL and $deleteDocById
Hi, I am trying to use $deleteDocById to delete rows based on an SQL query in my db-data-config.xml. The following tag is a top level tag in the document tag. entity name=company_del query=SELECT e.id AS `$deleteDocById` ROM deletedentity AS e/ However it seems like its only fetching the rows, its not actually issuing any index deletes. regards, Lukas Kahwe Smith m...@pooteeweet.org
Re: PDF extraction leads to reversed words
Hi again , I just came from trying the version 1.5-dev from Solr trunk. After applying the patch you provided, and adding icu4j-3_8_1 in classpath, results are pretty good different then before. Now words and texts are not reversed and are displayed correctly except some pdf files's text parts that Solr display in a strange manner, specially when arabic and latin are in the same paragraph, I 'll check again for this. On Tue, Mar 9, 2010 at 4:13 PM, Robert Muir rcm...@gmail.com wrote: On Tue, Mar 9, 2010 at 10:10 AM, Abdelhamid ABID aeh.a...@gmail.com wrote: nor 3.8 version does change anythings ! the patch (https://issues.apache.org/jira/browse/SOLR-1813) can only work on Solr trunk. It will not work with Solr 1.4. Solr 1.4 uses pdfbox-0.7.3.jar, which does not support Arabic. Solr trunk uses pdfbox-0.8.0-incubating.jar, which does support Arabic, if you also put ICU in the classpath. -- Robert Muir rcm...@gmail.com -- Abdelhamid ABID Software Engineer- J2EE / WEB / ESB MULE
Switching data dir on the fly
I generate solr index on an hadoop cluster and I want to copy it from HDFS to a server running solr. I wish to copy the index on a different disk than the disk that solr instance is using, then tell the solr server to switch from the current data dir to the location where I copied the hadoop generated index (without having search service interruptions). Is it possible? Anyone has a better solution? Thanks -- View this message in context: http://old.nabble.com/Switching-data-dir-on-the-fly-tp27920425p27920425.html Sent from the Solr - User mailing list archive at Nabble.com.
Stemming suggestions
Most of our documents will be in English but not all and we are certain in the process of acquiring more international content. Does anyone have any experience using all of the different stemmers for languages of unknown origin? Which ones perform the best? Give the most relevant results? What are the main advantages of each one? I've heard that the KStemmer is a less-aggressive stemmer and it is supposed to perform quite well will it work for multi-languages? Any suggestions would be appreciated. Thanks -- View this message in context: http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
I used it mostly for KStemmer, but I also liked the fact that it included about a dozen or so stable patches since Solr 1.4 was released. We just use the included WAR in our project however. We don't use the installer or anything like that. From: blargy zman...@hotmail.com To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 11:52:17 AM Subject: LucidWorks Solr Has anyone used this?: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr Other than the KStemmer and installer what are the other enhancements that this download offers? Is it worth using over the default Solr installation? Thanks -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LucidWorks Solr
I'm trying it out right now. I hope it will work well out-of-box for indexing/searching a set of documents with frequent update. -aj On Tue, Mar 16, 2010 at 11:52 AM, blargy zman...@hotmail.com wrote: Has anyone used this?: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr Other than the KStemmer and installer what are the other enhancements that this download offers? Is it worth using over the default Solr installation? Thanks -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html Sent from the Solr - User mailing list archive at Nabble.com. -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA 650-283-4091 *Building social media monitoring pipeline, and connecting social customers to CRM*
Re: Stemming suggestions
If you search the mail archive, you'll find many discussions of multilingual indexing/searching that'll provide you a plethora of information. But the synopsis as I remember is that using a single stemmer for multiple languages is generally a bad idea Best Erick On Tue, Mar 16, 2010 at 12:19 PM, blargy zman...@hotmail.com wrote: Most of our documents will be in English but not all and we are certain in the process of acquiring more international content. Does anyone have any experience using all of the different stemmers for languages of unknown origin? Which ones perform the best? Give the most relevant results? What are the main advantages of each one? I've heard that the KStemmer is a less-aggressive stemmer and it is supposed to perform quite well will it work for multi-languages? Any suggestions would be appreciated. Thanks -- View this message in context: http://old.nabble.com/Stemming-suggestions-tp27920788p27920788.html Sent from the Solr - User mailing list archive at Nabble.com.
Moving From Oracle Text Search To Solr
I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
Re: LucidWorks Solr
Kevin, When you say you just included the war you mean the /packs/solr.war correct? I see that the KStemmer is nicely packed in there but I don't see LucidGaze anywhere. Have you had any experience using this? So I'm guessing you would suggest using the LucidWorks solr.war over the apache-solr-war just because of the various bug-fixes/tests. As a side question. Is there a reason you choose the LucidKStemmer over any other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would work best. Thanks again! Kevin Osborn-2 wrote: I used it mostly for KStemmer, but I also liked the fact that it included about a dozen or so stable patches since Solr 1.4 was released. We just use the included WAR in our project however. We don't use the installer or anything like that. From: blargy zman...@hotmail.com To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 11:52:17 AM Subject: LucidWorks Solr Has anyone used this?: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr Other than the KStemmer and installer what are the other enhancements that this download offers? Is it worth using over the default Solr installation? Thanks -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Moving From Oracle Text Search To Solr
Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
XML data in solr field
Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document named inputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag like Venue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair
Re: LucidWorks Solr
For my purposes, the Porter analyzer was overly aggressive with stemming. So, we then moved to KStem. It looks like this is no longer being maintained and Lucid claimed much better performance with theirs, so I gave that a try and it seems to be working fine. I didn't do any benchmarks though. And I just took the war in LucidWorks\dist. I think in the install instructions, there was also a script to apply to the included source code as well. I did that as well since I look at the source regularly. I didn't look at LudidGlaze or any of the other Lucid features. -Kevin From: blargy zman...@hotmail.com To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 12:31:09 PM Subject: Re: LucidWorks Solr Kevin, When you say you just included the war you mean the /packs/solr.war correct? I see that the KStemmer is nicely packed in there but I don't see LucidGaze anywhere. Have you had any experience using this? So I'm guessing you would suggest using the LucidWorks solr.war over the apache-solr-war just because of the various bug-fixes/tests. As a side question. Is there a reason you choose the LucidKStemmer over any other stemmers (KStemmer, Porter, etc)? I'm unsure of which stemmer would work best. Thanks again! Kevin Osborn-2 wrote: I used it mostly for KStemmer, but I also liked the fact that it included about a dozen or so stable patches since Solr 1.4 was released. We just use the included WAR in our project however. We don't use the installer or anything like that. From: blargy zman...@hotmail.com To: solr-user@lucene.apache.org Sent: Tue, March 16, 2010 11:52:17 AM Subject: LucidWorks Solr Has anyone used this?: http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr Other than the KStemmer and installer what are the other enhancements that this download offers? Is it worth using over the default Solr installation? Thanks -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27922870.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/LucidWorks-Solr-tp27922870p27923359.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Moving From Oracle Text Search To Solr
I've also index a concatenation of 50k journal articles (making a single document of several hundred MB of text) and it did not give me an OOM. -glen On 16 March 2010 15:57, Erick Erickson erickerick...@gmail.com wrote: Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks. -- -
PDFBox/Tika Performance Issues
I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) org.apache.catalina.core.StandardContextValve.invoke(Request, Response) org.apache.catalina.core.StandardHostValve.invoke(Request, Response) org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, Object[]) org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, Object[]) org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() java.lang.Thread.run() Has anyone run into this before? Any ideas on how to reduce the contention? Thanks, Gio.
Re: Moving From Oracle Text Search To Solr
If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use bitmapped indexes which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
Re: XML data in solr field
Do you have the option of just importing each xml node as a field/value when you add the document? That'll let you do the search easily. If you need to store the raw XML, you can use an extra field. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/16/10 12:59 PM, Nair, Manas wrote: Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair
Solr RAM Requirements
Hey, I am trying to understand what kind of calculation I should do in order to come up with reasonable RAM size for a given solr machine. Suppose the index size is at 16GB. The Max heap allocated to JVM is about 12GB. The machine I'm trying now has 24GB. When the machine is running for a while serving production, I can see in top that the resident memory taken by the jvm is indeed at 12gb. Now, on top of this i should assume that if i want the whole index to fit in disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind of calculation correct or am i off here? Any other recommendations Anyone could make w.r.t these numbers ? Thanks, -Chak -- View this message in context: http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: PDFBox/Tika Performance Issues
Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) org.apache.catalina.core.StandardContextValve.invoke(Request, Response) org.apache.catalina.core.StandardHostValve.invoke(Request, Response) org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, Object[]) org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, Object[]) org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() java.lang.Thread.run() Has anyone run into this before? Any ideas on how to reduce the contention? Thanks, Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
RE: Moving From Oracle Text Search To Solr
That is a great article, David. For the moment, I am trying an all-Solr approach, but I have run into a small problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. Is there any facility to unpack this into the actual text? Or must I execute that in the SQL query? Thanks. -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Tuesday, March 16, 2010 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use bitmapped indexes which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
RE: PDFBox/Tika Performance Issues
Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) org.apache.catalina.core.StandardContextValve.invoke(Request, Response) org.apache.catalina.core.StandardHostValve.invoke(Request, Response) org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, Object[]) org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, Object[]) org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() java.lang.Thread.run() Has anyone run into this before? Any ideas on how to reduce the contention? Thanks, Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: PDFBox/Tika Performance Issues
Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response) org.apache.catalina.core.StandardContextValve.invoke(Request, Response) org.apache.catalina.core.StandardHostValve.invoke(Request, Response) org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response) org.apache.catalina.core.StandardEngineValve.invoke(Request, Response) org.apache.catalina.connector.CoyoteAdapter.service(Request, Response) org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream) org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection, Object[]) org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, Object[]) org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[]) org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run() java.lang.Thread.run() Has anyone run into this before? Any ideas on how to reduce the contention? Thanks, Gio. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/
Re: Trouble Implementing Extracting Request Handler
NoClassDefFoundError usually means that the class was found, but it needs other classes and those were not found. That is, Solr finds the ExtractingRequestHandler jar but cannot find the Tika jars. In example/solr/conf/slrconfig.xml, there are several 'lib dir=path/' elements. These give classpath directories and jar files to include when loading classes (and resource files). Try adding the paths for your Tika jars as lib/ directives. On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgut sreich...@axtaweb.com wrote: Sure. I've attached two docs that have the stack trace and the full list of .jar files. On 3/15/2010 8:34 PM, Lance Norskog wrote: Please post the complete stack trace. Also, it will help if you make a full listing of all .jar files in the example/ directory. On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com wrote: Thanks Lance. That helped ( we are using Solr-1.4). We've run into a follow-on error though. It is giving the following error: ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware Did we miss something else in the setup? Steve Is there something else we haven't copied On 3/15/2010 6:12 PM, Lance Norskog wrote: This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk. The ExtractingRequestHandler libraries are in contrib/extracting/lib You need to make a directory example/solr/lib and copy into it the apache-solr-cell jar from dist/ and all of the libraries from contrib/extracting/lib. The Wiki page has not been updated for the Solr 1.4 release. I just added a TODO to this effect. On 3/12/10, Steve Reichgutsreich...@axtaweb.com wrote: Hi Grant, Thanks for the feedback. In reading the Wiki, it recommended that you copy everything from example/solr/libs directory into a /libs directory in your instance. I went into my example/solr directory and only see two directories - bin and conf. There is no libs directory. Where else can I get the contents of what should be in libs? Steve On 3/12/2010 2:15 PM, Grant Ingersoll wrote: On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote: Now that I have configured my Solr instance for standard indexing, I wanted to start indexing PDF's, MS Doc's, etc. When I tried to test it with a simple PDF file, I got the following error: org.apache.solr.common.SolrException: lazy loading error Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' Based on the error, it appeared that the problem is caused by certain components not being installed or installed correctly. Since I am not a Java guy, I had my Java person try to install the ExtractingRequestHandler to no avail. He had said that he was having real trouble finding good documentation on how to install and enable this handler. Could anyone point me to good documentation on how to install/troubleshoot this? http://wiki.apache.org/solr/ExtractingRequestHandler Essentially, you need to make sure the ERH stuff is in Solr/lib before starting. -Grant -- Lance Norskog goks...@gmail.com
Re: DIH request parameters
They are a namespace like other namespaces and are useable in attributes, just like in the DB query string examples. As to defaults, you can declare those in the requestHandler declarations in solrconfig.xml. Examples of this (search for defaults) in the wiki page. On Tue, Mar 16, 2010 at 7:05 AM, Lukas Kahwe Smith m...@pooteeweet.org wrote: Hi, According to the wiki its possible to pass parameters to the DIH: http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters I assume they are just being replaced via simple string replacements, which is exactly what I need. Can they also be in all places, even attributes (for example to pass in the password)? Furthermore is there some way to define default values for these request parameters in case no value is passed in? regards, Lukas Kahwe Smith m...@pooteeweet.org -- Lance Norskog goks...@gmail.com
RE: PDFBox/Tika Performance Issues
I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. This is what I've tried so far (which was really just me guessing): 1. Got the latest version of the trunk code from http://svn.apache.org/repos/asf/lucene/tika/trunk 2. Built this using Maven (mvn install) 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib folder for my Solr Core, and renamed it to the name of the existing Tika Jar (tika-0.3.jar). 4. Then I bounced my servlet server and tried indexing a document. The document was successfully indexed, and there were no errors logged as a result, but the PDF data does not appear to have been extracted (the field I used for map.content had an empty-string as a value). What's the right approach to perform this patch? -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Tuesday, March 16, 2010 5:41 PM To: solr-user@lucene.apache.org Subject: RE: PDFBox/Tika Performance Issues Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata) org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest, SolrQueryResponse, ContentStream) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest, SolrQueryResponse) org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest, SolrQueryResponse) org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, SolrRequestHandler, SolrQueryRequest, SolrQueryResponse) org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, ServletResponse, FilterChain) org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest, ServletResponse) org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, ServletResponse) org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
Undefined field price on Dismax query
Hi guys, Based on some suggestions, I'm trying to use the dismax query type. I'm getting a weird error though that I think it related to the default test data set. From the query tool (/solr/admin/form.jsp), I put in this: Statement: artist:test title:test +type:video query type: dismax The rest is left as defaults. I get this error page: HTTP ERROR: 400 undefined field price RequestURI=/solr/select I am running out of the example dir still, but I made my own custom schema and deleted the index before inserting my new data. Am I missing something that needs to be cleared? Query type=standard works fine here. Thanks, Alex
Re: Moving From Oracle Text Search To Solr
The DataImportHandler has tools for this. It will fetch rows from Oracle and allow you to unpack columns as XML with Xpaths. http://wiki.apache.org/solr/DataImportHandler http://wiki.apache.org/solr/DataImportHandler#Usage_with_RDBMS http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor On Tue, Mar 16, 2010 at 2:25 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: That is a great article, David. For the moment, I am trying an all-Solr approach, but I have run into a small problem. The documents are stored as XML CLOB's using Oracle's OPAQUE object. Is there any facility to unpack this into the actual text? Or must I execute that in the SQL query? Thanks. -Original Message- From: Smiley, David W. [mailto:dsmi...@mitre.org] Sent: Tuesday, March 16, 2010 4:45 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr If you do stay with Oracle, please report back to the list how that went. In order to get decent filtering and faceting performance, I believe you will need to use bitmapped indexes which Oracle and some other databases support. You may want to check out my article on this subject: http://www.packtpub.com/article/text-search-your-database-or-solr ~ David Smiley Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/ On Mar 16, 2010, at 4:13 PM, Neil Chaudhuri wrote: Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks. -- Lance Norskog goks...@gmail.com
Indexing CLOB Column in Oracle
Since my original thread was straying to a new topic, I thought it made sense to create a new thread of discussion. I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type, which is an instance of oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob. So in my db-data-config, I have the following: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID FROM DOC d field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / entity name=text query=SELECT d.XML FROM DOC d WHERE d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' transformer=ClobTransformer field column=XML name=text clob=true sourceColName=XML / /entity /entity /document Meanwhile, I have this in schema.xml: field name=text type=text_ws indexed=true stored=true multiValued=true omitNorms=false termVectors=true / However, when I take a look at my indexes with Luke, I find that the items labeled text simply say oracle.sql.OPAQUE and a bunch of numbers-in other words, the OPAQUE.toString(). Can you give me some insight into where I am going wrong? Thanks.
Re: Trouble Implementing Extracting Request Handler
Lance, I tried that but no luck. Just in case the relative paths were causing a problem, I also tried using absolute paths but neither seemed to help. First, I tried adding *lib dir=/path/to/example/solr/lib /* as the full directory so it would hopefully include everything. When that didn't work, I tried adding paths directly to the two Tika jar files in the Lib directory like this: *lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and *lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /* Am I including them incorrectly somehow? Steve On 3/16/2010 3:38 PM, Lance Norskog wrote: NoClassDefFoundError usually means that the class was found, but it needs other classes and those were not found. That is, Solr finds the ExtractingRequestHandler jar but cannot find the Tika jars. In example/solr/conf/slrconfig.xml, there are several 'lib dir=path/' elements. These give classpath directories and jar files to include when loading classes (and resource files). Try adding the paths for your Tika jars aslib/ directives. On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com wrote: Sure. I've attached two docs that have the stack trace and the full list of .jar files. On 3/15/2010 8:34 PM, Lance Norskog wrote: Please post the complete stack trace. Also, it will help if you make a full listing of all .jar files in the example/ directory. On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com wrote: Thanks Lance. That helped ( we are using Solr-1.4). We've run into a follow-on error though. It is giving the following error: ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware Did we miss something else in the setup? Steve Is there something else we haven't copied On 3/15/2010 6:12 PM, Lance Norskog wrote: This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk. The ExtractingRequestHandler libraries are in contrib/extracting/lib You need to make a directory example/solr/lib and copy into it the apache-solr-cell jar from dist/ and all of the libraries from contrib/extracting/lib. The Wiki page has not been updated for the Solr 1.4 release. I just added a TODO to this effect. On 3/12/10, Steve Reichgutsreich...@axtaweb.com wrote: Hi Grant, Thanks for the feedback. In reading the Wiki, it recommended that you copy everything from example/solr/libs directory into a /libs directory in your instance. I went into my example/solr directory and only see two directories - bin and conf. There is no libs directory. Where else can I get the contents of what should be in libs? Steve On 3/12/2010 2:15 PM, Grant Ingersoll wrote: On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote: Now that I have configured my Solr instance for standard indexing, I wanted to start indexing PDF's, MS Doc's, etc. When I tried to test it with a simple PDF file, I got the following error: org.apache.solr.common.SolrException: lazy loading error Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' Based on the error, it appeared that the problem is caused by certain components not being installed or installed correctly. Since I am not a Java guy, I had my Java person try to install the ExtractingRequestHandler to no avail. He had said that he was having real trouble finding good documentation on how to install and enable this handler. Could anyone point me to good documentation on how to install/troubleshoot this? http://wiki.apache.org/solr/ExtractingRequestHandler Essentially, you need to make sure the ERH stuff is in Solr/lib before starting. -Grant
Re: Indexing CLOB Column in Oracle
Disclaimer: My Oracle experience is miniscule at best. I am also a beginner at Solr, so grab yourself the proverbial grain of salt. I googled a bit on CLOB. One page I found mentioned setting up a view to return the data type you want. Can you use the functions described on these pages in either the Solr query or a view? http://www.oradev.com/dbms_lob.jsp http://www.dba-oracle.com/t_dbms_lob.htm http://www.praetoriate.com/dbms_packages/ddp_dbms_lob.htm I also was trying to find a way to convert from xmltype directly to a string in a query, but that quickly got way over my level of understanding. I saw hints that it is possible, though. Shawn On 3/16/2010 4:59 PM, Neil Chaudhuri wrote: Since my original thread was straying to a new topic, I thought it made sense to create a new thread of discussion. I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type, which is an instance of oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.
Re: Solr RAM Requirements
On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati jimmoe...@gmail.comwrote: Hey, I am trying to understand what kind of calculation I should do in order to come up with reasonable RAM size for a given solr machine. Suppose the index size is at 16GB. The Max heap allocated to JVM is about 12GB. The machine I'm trying now has 24GB. When the machine is running for a while serving production, I can see in top that the resident memory taken by the jvm is indeed at 12gb. Now, on top of this i should assume that if i want the whole index to fit in disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind of calculation correct or am i off here? Hmmm..not quite. The idea of the ram usage isn't to simply hold the index in memory - if you want this use a RAMDirectory. The memory being used will be a combination of various caches (Lucene and Solr), index buffers et al., and of course the server itself. The specifics depend very much on what your server is doing at any given time - e.g. lots of concurrent searches, lots of indexing, both etc., and how things are setup in your solrconfig.xml. A really excellent resource that's worth looking at regarding all this can be found here: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr Any other recommendations Anyone could make w.r.t these numbers ? Thanks, -Chak -- View this message in context: http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Undefined field price on Dismax query
I suspect your problem is that you still have price defined in solrconfig.xml for the dismax handler. Look for the section requestHandler name=dismax.. You'll see price defined as one of the default fields for fl and bf. HTH Erick On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlow a...@blastro.com wrote: Hi guys, Based on some suggestions, I'm trying to use the dismax query type. I'm getting a weird error though that I think it related to the default test data set. From the query tool (/solr/admin/form.jsp), I put in this: Statement: artist:test title:test +type:video query type: dismax The rest is left as defaults. I get this error page: HTTP ERROR: 400 undefined field price RequestURI=/solr/select I am running out of the example dir still, but I made my own custom schema and deleted the index before inserting my new data. Am I missing something that needs to be cleared? Query type=standard works fine here. Thanks, Alex
Re: Moving From Oracle Text Search To Solr
Besides the other notes here, I agree you'll hit OOM if you try to read all the rows into memory at once, but I'm absolutely sure you can read then N at a time instead. Not that I could tell you how, mind you. You're on your way... Erick On Tue, Mar 16, 2010 at 4:13 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: Certainly I could use some basic SQL count(*) queries to achieve faceted results, but I am not sure of the flexibility, extensibility, or scalability of that approach. And from what I have read, Oracle Text doesn't do faceting out of the box. Each document is a few MB, and there will be millions of them. I suppose it depends on how I index them. I am pretty sure my current approach of using Hibernate to load all rows, constructing Solr POJO's from them, and then passing the POJO's to the embedded server would lead to a OOM error. I should probably look into the other options. Thanks. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, March 16, 2010 3:58 PM To: solr-user@lucene.apache.org Subject: Re: Moving From Oracle Text Search To Solr Why do you think you'd hit OOM errors? How big is very large? I've indexed, as a single document, a 26 volume encyclopedia of civil war records.. Although as much as I like the technology, if I could get away without using two technologies, I would. Are you completely sure you can't get what you want with clever Oracle querying? Best Erick On Tue, Mar 16, 2010 at 3:20 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: I am working on an application that currently hits a database containing millions of very large documents. I use Oracle Text Search at the moment, and things work fine. However, there is a request for faceting capability, and Solr seems like a technology I should look at. Suffice to say I am new to Solr, but at the moment I see two approaches-each with drawbacks: 1) Have Solr index document metadata (id, subject, date). Then Use Oracle Text to do a content search based on criteria. Finally, query the Solr index for all documents whose id's match the set of id's returned by Oracle Text. That strikes me as an unmanageable Boolean query. (e.g. id:4ORid:33432323OR...). 2) Remove Oracle Text from the equation and use Solr to query document content based on search criteria. The indexing process though will almost certainly encounter an OutOfMemoryError given the number and size of documents. I am using the embedded server and Solr Java APIs to do the indexing and querying. I would welcome your thoughts on the best way to approach this situation. Please let me know if I should provide additional information. Thanks.
Re: Undefined field price on Dismax query
Aha. That appears to be the issue. I hadn't realized that the query handler had all of those definitions there. -Alex On 3/16/2010 6:56 PM, Erick Erickson wrote: I suspect your problem is that you still have price defined in solrconfig.xml for the dismax handler. Look for the section requestHandler name=dismax.. You'll see price defined as one of the default fields for fl and bf. HTH Erick On Tue, Mar 16, 2010 at 6:55 PM, Alex Thurlowa...@blastro.com wrote: Hi guys, Based on some suggestions, I'm trying to use the dismax query type. I'm getting a weird error though that I think it related to the default test data set. From the query tool (/solr/admin/form.jsp), I put in this: Statement: artist:test title:test +type:video query type: dismax The rest is left as defaults. I get this error page: HTTP ERROR: 400 undefined field price RequestURI=/solr/select I am running out of the example dir still, but I made my own custom schema and deleted the index before inserting my new data. Am I missing something that needs to be cleared? Query type=standard works fine here. Thanks, Alex
Solr query parser doesn't invoke analyzer for simple term query?
It seems that Solr's query parser doesn't pass a single term query to the Analyzer for the field. For example, if I give it 2001年 (year 2001 in Japanese), the searcher returns 0 hits but if I quote them with double-quotes, it returns hits. In this experiment, I configured schema.xml so that the field in question will use the morphological Analyzer my company makes that is capable of splitting 2001年 into two tokens 2001 and 年. I am guessing that this Analyzer is called ONLY IF the term is a phrase. Is my observation correct? If so, is there any configuration parameter that I can tweak to force any query for the text fields be processed by the Analyzer? One might ask why users won't put space between 2001 and 年. Well if they are clearly two separate words, people do that. But 年 works more like a suffix in this case, and in many Japanese speaker's mind, 2001年 seems like one token, so many people won't. (Remember Japanese don't use spaces in normal writing.) Forcing to use Analyzer would also be useful for compound word handling often desirable for languages like German. Teruhiko Kuro Kurosaka RLP + Lucene Solr = powerful search for global contents
problem during benchmarking solr query
Hi, Am using autobench to benchmark solr with the query http://localhost:8983/solr/select/?q=body:hotel AND _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100 But if i specify the same in the autobench command as autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20 --host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10 --uri1 /solr/select/?q=body:hotel AND _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100 it is taking body:hotel as uri but not _val_ part ,which i think is because of the space after hotel. Even if i try escaping this in autobench using '\' it ll give parse error in solr. Can any one suggest me how do i handle this?so that entire query is considered as uri and also solr respond with appropriate reply. thank you. -- View this message in context: http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr RAM Requirements
There are certainly a number of widely varying opinions on the use of RAM directory. Basically, though, if you need the index to be persistent at some point (i.e. saved across reboots, crashes etc.), you'll need to write to a disk, so RAM directory becomes somewhat superfluous in this case. Generally, good hardware and fast disks are a better bet, since you'll probably want to have them anyway :-) From my own experiences with varying types/sizes of indexes, and the general wisdom gleamed from the experts, the amount of memory required for a given environment is very much a 'how long is a piece of string' type of scenario. It depends on so many factors that it's impractical to come up with a easy 'standardized' formula. What I've found useful as a rough guidance (in additon to the very useful URL I mentioned earlier), is if your server is doing lots of indexing and not much searching, you want your os fs cache to have access to a healthy amount of memory. If you're doing lots of searching/reading (and particularly faceting), you'll want a good amount of ram for Solr/Lucene caching (which caches need what depends on the type of data you're searching). If you have a server that is doing a lot of both indexing and searching, you should consider breaking them out using replication and possibly using load balancers (if you have lots of concurrent querying going on). It stands to reason that the bigger the index gets, the more memory will generally be required for working on various aspects of it. When you get into very large indexes, it becomes more efficient to distribute the indexing across servers (and replicating those servers), so that no single machine has huge cache lists to traverse. Again, the 'Scaling Lucene and Solr' page goes into these scenarios and is well worth studying. On Wed, Mar 17, 2010 at 12:29 AM, KaktuChakarabati jimmoe...@gmail.comwrote: Hey Peter, Thanks for your reply. My question was mainly about the fact there seems to be two different aspects to the solr RAM usage: in-process and out-process. By that I mean, yes i know the many different parameters/caches to do with solr in-process memory usage and related culprits, however I also understand that as for actual index access (posting list, positional index etc), solr mostly delegates the access/caching of this to the OS/disk cache. So I guess my question is more about that: namely, what would be a good way to calculate an overall ram requirement profile for a server running solr? Also, I was under the impression benefits from RAMDirectory would be minimal when disk caches are effective no? And does RAMDirectory work with replication? if so, doesnt it slow it down? ( on each replication, load up entire index to RAM at once? ) Peter Sturge wrote: On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati jimmoe...@gmail.comwrote: Hey, I am trying to understand what kind of calculation I should do in order to come up with reasonable RAM size for a given solr machine. Suppose the index size is at 16GB. The Max heap allocated to JVM is about 12GB. The machine I'm trying now has 24GB. When the machine is running for a while serving production, I can see in top that the resident memory taken by the jvm is indeed at 12gb. Now, on top of this i should assume that if i want the whole index to fit in disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind of calculation correct or am i off here? Hmmm..not quite. The idea of the ram usage isn't to simply hold the index in memory - if you want this use a RAMDirectory. The memory being used will be a combination of various caches (Lucene and Solr), index buffers et al., and of course the server itself. The specifics depend very much on what your server is doing at any given time - e.g. lots of concurrent searches, lots of indexing, both etc., and how things are setup in your solrconfig.xml. A really excellent resource that's worth looking at regarding all this can be found here: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr Any other recommendations Anyone could make w.r.t these numbers ? Thanks, -Chak -- View this message in context: http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27924551.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27926536.html Sent from the Solr - User mailing list archive at Nabble.com.
Stopwords
I was reading Scaling Lucen and Solr (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) and I came across the section StopWords. In there it mentioned that its not recommended to remove stop words at index time. Why is this the case? Don't all the extraneous stopwords bloat the index and lead to less relevant results? Can someone please explain this to me. Thanks -- View this message in context: http://old.nabble.com/Stopwords-tp27927028p27927028.html Sent from the Solr - User mailing list archive at Nabble.com.
APR setup
[java] INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: .:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java What the heck is this and why is it recommended for production settings? Anyone? -- View this message in context: http://old.nabble.com/APR-setup-tp27927553p27927553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Trouble Implementing Extracting Request Handler
org/apache/solr/util/plugin/SolrCoreAware in the stack trace refers to an interface in the main Solr jar. I think this means that putting all of the libs in apache-tomcat-6.0.20/lib is a mistake: the classloader finds ExtractingRequestHandler in apache-tomcat-6.0.20/lib/apache-solr-cell-1.4.1-dev.jar, but that it wants the above interface. The main Solr jar is not available somehow. Since the solr-cell jar is in multiple places, we don't know exactly how Tomcat finds it. I suggest that you go back to a clean, empty Tomcat, and the original Solr distribution. Copy the solr war file to the right directory in Tomcat. Get Solr talking to your solr/ directory (-Dsolr.solr.home=path). Now, check if the lib directives in the solrconfig.xml are right. On Tue, Mar 16, 2010 at 4:19 PM, Steve Reichgut sreich...@axtaweb.com wrote: Lance, I tried that but no luck. Just in case the relative paths were causing a problem, I also tried using absolute paths but neither seemed to help. First, I tried adding *lib dir=/path/to/example/solr/lib /* as the full directory so it would hopefully include everything. When that didn't work, I tried adding paths directly to the two Tika jar files in the Lib directory like this: *lib dir=/path/to/example/solr/lib/tika-core-0.4.jar / *and *lib dir=/path/to/example/solr/lib/tika-parsers-0.4.jar /* Am I including them incorrectly somehow? Steve On 3/16/2010 3:38 PM, Lance Norskog wrote: NoClassDefFoundError usually means that the class was found, but it needs other classes and those were not found. That is, Solr finds the ExtractingRequestHandler jar but cannot find the Tika jars. In example/solr/conf/slrconfig.xml, there are several 'lib dir=path/' elements. These give classpath directories and jar files to include when loading classes (and resource files). Try adding the paths for your Tika jars aslib/ directives. On Mon, Mar 15, 2010 at 9:02 PM, Steve Reichgutsreich...@axtaweb.com wrote: Sure. I've attached two docs that have the stack trace and the full list of .jar files. On 3/15/2010 8:34 PM, Lance Norskog wrote: Please post the complete stack trace. Also, it will help if you make a full listing of all .jar files in the example/ directory. On Mon, Mar 15, 2010 at 7:12 PM, Steve Reichgutsreich...@axtaweb.com wrote: Thanks Lance. That helped ( we are using Solr-1.4). We've run into a follow-on error though. It is giving the following error: ClassNotFoundException: org.apache.solr.util.plugin.SolrCoreAware Did we miss something else in the setup? Steve Is there something else we haven't copied On 3/15/2010 6:12 PM, Lance Norskog wrote: This assumes you use the Solr-1.4 release or the Solr-1.5-dev trunk. The ExtractingRequestHandler libraries are in contrib/extracting/lib You need to make a directory example/solr/lib and copy into it the apache-solr-cell jar from dist/ and all of the libraries from contrib/extracting/lib. The Wiki page has not been updated for the Solr 1.4 release. I just added a TODO to this effect. On 3/12/10, Steve Reichgutsreich...@axtaweb.com wrote: Hi Grant, Thanks for the feedback. In reading the Wiki, it recommended that you copy everything from example/solr/libs directory into a /libs directory in your instance. I went into my example/solr directory and only see two directories - bin and conf. There is no libs directory. Where else can I get the contents of what should be in libs? Steve On 3/12/2010 2:15 PM, Grant Ingersoll wrote: On Mar 12, 2010, at 2:20 PM, Steve Reichgut wrote: Now that I have configured my Solr instance for standard indexing, I wanted to start indexing PDF's, MS Doc's, etc. When I tried to test it with a simple PDF file, I got the following error: org.apache.solr.common.SolrException: lazy loading error Caused by: org.apache.solr.common.SolrException: Error loading class 'org.apache.solr.handler.extraction.ExtractingRequestHandler' Based on the error, it appeared that the problem is caused by certain components not being installed or installed correctly. Since I am not a Java guy, I had my Java person try to install the ExtractingRequestHandler to no avail. He had said that he was having real trouble finding good documentation on how to install and enable this handler. Could anyone point me to good documentation on how to install/troubleshoot this? http://wiki.apache.org/solr/ExtractingRequestHandler Essentially, you need to make sure the ERH stuff is in Solr/lib before starting. -Grant -- Lance Norskog goks...@gmail.com
spanish solr tutorial
Hi all, we translated the Solr tutorial to Spanish due to a client's request. For all you Spanish speakers/readers out there, you can have a look at it: http://www.linebee.com/?p=155 We hope this can expand the usage of the project and lower the language barrier to non-english speakers. Thanks Juan Danculovic CTO - www.linebee.com
Re: APR setup
That would be a Tomcat question :) On Tue, Mar 16, 2010 at 8:36 PM, blargy zman...@hotmail.com wrote: [java] INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: .:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java What the heck is this and why is it recommended for production settings? Anyone? -- View this message in context: http://old.nabble.com/APR-setup-tp27927553p27927553.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: problem during benchmarking solr query
Use a + sign or %20 for the space. The URL standard uses a plus to mean a space. On Tue, Mar 16, 2010 at 6:06 PM, KshamaPai kshamapai2...@gmail.com wrote: Hi, Am using autobench to benchmark solr with the query http://localhost:8983/solr/select/?q=body:hotel AND _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100 But if i specify the same in the autobench command as autobench --file bar1.tsv --high_rate 100 --low_rate 20 --rate_step 20 --host1 localhost --single_host --port1 8983 --num_conn 10 --num_call 10 --uri1 /solr/select/?q=body:hotel AND _val_:recip(hsin(0.7113258,-1.291311553,lat_rad,lng_rad,30),1,1,0)^100 it is taking body:hotel as uri but not _val_ part ,which i think is because of the space after hotel. Even if i try escaping this in autobench using '\' it ll give parse error in solr. Can any one suggest me how do i handle this?so that entire query is considered as uri and also solr respond with appropriate reply. thank you. -- View this message in context: http://old.nabble.com/problem-during-benchmarking-solr-query-tp27926801p27926801.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: PDFBox/Tika Performance Issues
Hi Giovanni, Comments below: I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. This is what I've tried so far (which was really just me guessing): 1. Got the latest version of the trunk code from http://svn.apache.org/repos/asf/lucene/tika/trunk 2. Built this using Maven (mvn install) On track so far. 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib folder for my Solr Core, and renamed it to the name of the existing Tika Jar (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. 4. Then I bounced my servlet server and tried indexing a document. The document was successfully indexed, and there were no errors logged as a result, but the PDF data does not appear to have been extracted (the field I used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Tuesday, March 16, 2010 5:41 PM To: solr-user@lucene.apache.org Subject: RE: PDFBox/Tika Performance Issues Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking each other during this call: http-8080-Processor39 [BLOCKED] CPU time: 9:35 java.util.Collections$SynchronizedMap.get(Object) org.pdfbox.pdmodel.font.PDFont.getAFM() org.pdfbox.pdmodel.font.PDSimpleFont.getFontHeight(byte[], int, int) org.pdfbox.util.PDFStreamEngine.showString(byte[]) org.pdfbox.util.operator.ShowTextGlyph.process(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processOperator(PDFOperator, List) org.pdfbox.util.PDFStreamEngine.processSubStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFStreamEngine.processStream(PDPage, PDResources, COSStream) org.pdfbox.util.PDFTextStripper.processPage(PDPage, COSStream) org.pdfbox.util.PDFTextStripper.processPages(List) org.pdfbox.util.PDFTextStripper.writeText(PDDocument, Writer) org.pdfbox.util.PDFTextStripper.getText(PDDocument) org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, Metadata) org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, Metadata) org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, Metadata)
Re: field length normalization
You need to change your similarity object to be more sensitive at the short end. This is a patch about how to do this: http://issues.apache.org/jira/browse/LUCENE-2187 It involves Lucene coding. On Fri, Mar 12, 2010 at 3:19 AM, muneeb muneeba...@hotmail.com wrote: Ah I see. Thanks very much Jay for your explanation, it really helped a lot. I guess I have to deal with this in some other way, since I am working with short titles and I really want short titles to appear at top. Can you suggest anything to bring titles with length 3 to appear before titles with length 4 (given they have similar scores)? Thanks, Jay Hill wrote: The fieldNorm is computed like this: fieldNorm = lengthNorm * documentBoost * documentFieldBoosts and the lengthNorm is: lengthNorm = 1/(numTermsInField)**.5 [note that the value is encoded as a single byte, so there is some precision loss] So the values are not pre-set for the lengthNorm, but for some counts the fieldLength value winds up being the same because of the precision los. Here is a list of lengthNorm values for 1 to 10 term fields: # of terms lengthNorm 1 1.0 2 .625 3 .5 4 .5 5 .4375 6 .375 7 .375 8 .3125 9 .3125 10 .3125 That's why, in your example, the lengthNorm for 3 and 4 is the same. -Jay http://www.lucidimagination.com On Thu, Mar 11, 2010 at 9:50 AM, muneeb muneeba...@hotmail.com wrote: : : Did you reindex after setting omitNorms to false? I'm not sure whether or : not it is needed, but it makes sense. Yes i deleted the old index and reindexed it. Just to add another fact, that the titlles length is less than 10. I am not sure if solr has pre-set values for length normalizations, because for titles with 3 as well as 4 terms the fieldNorm is coming up as 0.5 (in the debugQuery section). -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27867025.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://old.nabble.com/field-length-normalization-tp27862618p27874123.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Issue in search
In solr how can perform AND, OR, NOT search while querying the data -- View this message in context: http://old.nabble.com/Issue-in-search-tp27927828p27927828.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr RAM Requirements
Just turn your entire disk to RAM http://www.hyperossystems.co.uk/ 800X faster. Who cares if it swaps to 'disk' then :-) Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Tue, 3/16/10, Peter Sturge peter.stu...@googlemail.com wrote: From: Peter Sturge peter.stu...@googlemail.com Subject: Re: Solr RAM Requirements To: solr-user@lucene.apache.org Date: Tuesday, March 16, 2010, 6:25 PM There are certainly a number of widely varying opinions on the use of RAM directory. Basically, though, if you need the index to be persistent at some point (i.e. saved across reboots, crashes etc.), you'll need to write to a disk, so RAM directory becomes somewhat superfluous in this case. Generally, good hardware and fast disks are a better bet, since you'll probably want to have them anyway :-) From my own experiences with varying types/sizes of indexes, and the general wisdom gleamed from the experts, the amount of memory required for a given environment is very much a 'how long is a piece of string' type of scenario. It depends on so many factors that it's impractical to come up with a easy 'standardized' formula. What I've found useful as a rough guidance (in additon to the very useful URL I mentioned earlier), is if your server is doing lots of indexing and not much searching, you want your os fs cache to have access to a healthy amount of memory. If you're doing lots of searching/reading (and particularly faceting), you'll want a good amount of ram for Solr/Lucene caching (which caches need what depends on the type of data you're searching). If you have a server that is doing a lot of both indexing and searching, you should consider breaking them out using replication and possibly using load balancers (if you have lots of concurrent querying going on). It stands to reason that the bigger the index gets, the more memory will generally be required for working on various aspects of it. When you get into very large indexes, it becomes more efficient to distribute the indexing across servers (and replicating those servers), so that no single machine has huge cache lists to traverse. Again, the 'Scaling Lucene and Solr' page goes into these scenarios and is well worth studying. On Wed, Mar 17, 2010 at 12:29 AM, KaktuChakarabati jimmoe...@gmail.comwrote: Hey Peter, Thanks for your reply. My question was mainly about the fact there seems to be two different aspects to the solr RAM usage: in-process and out-process. By that I mean, yes i know the many different parameters/caches to do with solr in-process memory usage and related culprits, however I also understand that as for actual index access (posting list, positional index etc), solr mostly delegates the access/caching of this to the OS/disk cache. So I guess my question is more about that: namely, what would be a good way to calculate an overall ram requirement profile for a server running solr? Also, I was under the impression benefits from RAMDirectory would be minimal when disk caches are effective no? And does RAMDirectory work with replication? if so, doesnt it slow it down? ( on each replication, load up entire index to RAM at once? ) Peter Sturge wrote: On Tue, Mar 16, 2010 at 9:08 PM, KaktuChakarabati jimmoe...@gmail.comwrote: Hey, I am trying to understand what kind of calculation I should do in order to come up with reasonable RAM size for a given solr machine. Suppose the index size is at 16GB. The Max heap allocated to JVM is about 12GB. The machine I'm trying now has 24GB. When the machine is running for a while serving production, I can see in top that the resident memory taken by the jvm is indeed at 12gb. Now, on top of this i should assume that if i want the whole index to fit in disk cache i need about 12gb+16gb = 28GB of RAM just for that. Is this kind of calculation correct or am i off here? Hmmm..not quite. The idea of the ram usage isn't to simply hold the index in memory - if you want this use a RAMDirectory. The memory being used will be a combination of various caches (Lucene and Solr), index buffers et al., and of course the server itself. The specifics depend very much on what your server is doing at any given time - e.g. lots of concurrent searches, lots of indexing, both etc., and how things are setup in your solrconfig.xml. A really excellent resource that's worth looking at regarding all this can be found here: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr Any other recommendations Anyone could make w.r.t these numbers ? Thanks, -Chak -- View this message in context: