Re: A few questions about solr and tika
Everythink about Tika extraction is written under those links. Basicaly what you need is the following: 1) requestHandler for Tika in solrconfig.xml 2) keep all the fields in schema.xml that are needed for Tika (they are marked in example schema.xml) and set those you don't need to indexed=false and stored=false 3) if you want to limit the returned fields in query response use query parameter 'fl'. Primoz From: wonder a-wonde...@rambler.ru To: solr-user@lucene.apache.org Date: 17.10.2013 14:44 Subject:Re: A few questions about solr and tika Thanks for answer. If I dont want to store and index any fields i do: field name=links type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=link type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=img type=string indexed=false stored=false multiValued=true/!--удаление лишних TIKA-- field name=iframe type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=area type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=map type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=pragma type=string indexed=false stored=false multiValued=true/!--удаление лишних TIKA-- field name=expires type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=keywords type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- field name=stream_source_info type=string indexed=false stored=false multiValued=true/!--удаление лишних полей TIKA-- Other qestions is still open for me. 17.10.2013 14:26, primoz.sk...@policija.si пишет: Why don't you check these: - Content extraction with Apache Tika ( http://www.youtube.com/watch?v=ifgFjAeTOws) - ExtractingRequestHandler ( http://wiki.apache.org/solr/ExtractingRequestHandler) - Uploading Data with Solr Cell using Apache Tika ( https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika ) Primož From: wonder a-wonde...@rambler.ru To: solr-user@lucene.apache.org Date: 17.10.2013 12:23 Subject:A few questions about solr and tika Hello everyone! Please tell me how and where to set Tika options in Solr? Where is Tica conf? I'm want to know how I can eliminate not required to me response attribute(such as links or images)? Also I am interesting how i can get and index only metadata in several file formats?
Re: ExtractRequestHandler, skipping errors
Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote: Hi, I helped a customer to deployed solr+manifoldCF and everything is going quite smoothly, but every time solr is raising an exception, the manifoldcfjob feeding solr aborts. I would like to know if it is possible to configure the ExtractRequestHandler to ignore errors like it seems to be possible with dataimporthandler and entity processors. I know that it is possible to configure the ExtractRequestHandler to ignore tika exception (We already do that) but the errors that now stops the mcfjobs are generated by solr itself. While it is interesting to have such option in solr, I plan to post to the manifoldcf mailing list, anyway, to know if it is possible to configure manifolcf to be less picky about solr errors. ignoreTikaException flag might help you? https://issues.apache.org/**jira/browse/SOLR-2480https://issues.apache.org/jira/browse/SOLR-2480 koji -- http://soleami.com/blog/**automatically-acquiring-** synonym-knowledge-from-**wikipedia.htmlhttp://soleami.com/blog/automatically-acquiring-synonym-knowledge-from-wikipedia.html
Proximity search with wildcard
Hi, I am new to solr. Is it possible to do proximity search with solr. For example comp* engage~5. -- View this message in context: http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html Sent from the Solr - User mailing list archive at Nabble.com.
Complex Queries in solr
Hi, Is it possible to search complex queries like (consult* or advis*) NEAR(40) (fee or retainer or salary or bonus) in solr - Sayeed -- View this message in context: http://lucene.472066.n3.nabble.com/Complex-Queries-in-solr-tp4096288.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrconfig.xml carrot2 params
Hi, Out of curiosity -- what would you like to achieve by changing Tokenizer.documentFields? If you want to have clustering applied to more than one document field, you can provide a comma-separated list of fields in the carrot.title and/or carrot.snippet parameters. Thanks, Staszek -- Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net youknow...@heroicefforts.net wrote: Would someone help me out with the syntax for setting Tokenizer.documentFields in the ClusteringComponent engine definition in solrconfig.xml? Carrot2 is expecting a Collection of Strings. There's no schema definition for this XML file and a big TODO on the Wiki wrt init params. Every permutation I have tried results in an error stating: Cannot set java.until.Collection field ... to java.lang.String. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Re: Proximity search with wildcard
Hi Sayeed, you can use fuzzy search. comp engage~0.2. Regards harshvardhan ojha On Fri, Oct 18, 2013 at 10:28 AM, sayeed abdulsayeed...@gmail.com wrote: Hi, I am new to solr. Is it possible to do proximity search with solr. For example comp* engage~5. -- View this message in context: http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285.html Sent from the Solr - User mailing list archive at Nabble.com.
how to retireve content page in solr
Hi, i'm new in solr. I use Nutch 1.1 to crawl web pages. I use solr to indexer these pages. My problem is: how to retrieve the content information about a document stored il solr? Example If I have a page http://www.prova.com/prova.html that contains the text This is a web page Is there a way to retrieve the text This is a web page? Any ideas? My application is written in java. Thanks Danilo -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: ExtractRequestHandler, skipping errors
Hi, I think the flag cannot ignore NoSuchMethodError. There may be something wrong here? ... I've just checked my Solr 4.5 directories and I found Tika version is 1.4. Tika 1.4 seems to use commons compress 1.5: http://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory. Can you open a JIRA issue? For now, you can get commons compress 1.5 and put it to the directory (don't forget to remove 1.4.1 jar file). koji (13/10/18 16:37), Roland Everaert wrote: Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.SolrException; null:java.lang.RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:673) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:383) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:1852) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.compressors.CompressorStreamFactory.setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:102) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote: Hi, I helped a customer to deployed solr+manifoldCF and everything is going quite smoothly, but every time solr is raising an exception, the manifoldcfjob feeding solr aborts. I would like to know if it is possible to configure the ExtractRequestHandler to ignore errors like it seems to be possible with dataimporthandler and entity processors. I know that it is possible to configure the ExtractRequestHandler to ignore tika exception (We already do that) but the errors that now stops the mcfjobs are generated by solr itself. While it is interesting to have such option in solr, I plan to post to the manifoldcf mailing list, anyway, to know if it is possible to configure manifolcf to be less picky about solr errors. ignoreTikaException flag might help you? https://issues.apache.org/**jira/browse/SOLR-2480https://issues.apache.org/jira/browse/SOLR-2480 koji -- http://soleami.com/blog/**automatically-acquiring-**
Re: how to retireve content page in solr
Hi Danila, What do you mean by content information? A whole document? Metadata? do you keep it separate in some fields? Or is it about solr search queries? Regards Harshvardhan Ojha On Fri, Oct 18, 2013 at 1:09 PM, javozzo danilo.domen...@gmail.com wrote: Hi, i'm new in solr. I use Nutch 1.1 to crawl web pages. I use solr to indexer these pages. My problem is: how to retrieve the content information about a document stored il solr? Example If I have a page http://www.prova.com/prova.html that contains the text This is a web page Is there a way to retrieve the text This is a web page? Any ideas? My application is written in java. Thanks Danilo -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Debugging update request
@Michael: Yep, that's the bit that's addressed by the two patches I referenced. If you can try this with 4.5 (or the soon to be done 4.5.1), the problem should go away. @Chris: I think you have a different issue. A very quick glance at your stack trace doesn't really show anything outstanding. There are always a bunch of threads waiting around for something to do that show up as blocked. So I'm pretty puzzled. Are your Solr logs showing anything when you try to update after this occurs? On Wed, Oct 16, 2013 at 11:32 AM, Chris Geeringh geeri...@gmail.com wrote: Here is my jstack output... Lots of blocked threads. http://pastebin.com/1ktjBYbf On 16 October 2013 10:28, michael.boom my_sky...@yahoo.com wrote: I got the trace from jstack. I found references to semaphore but not sure if this is what you meant. Here's the trace: http://pastebin.com/15QKAz7U -- View this message in context: http://lucene.472066.n3.nabble.com/Debugging-update-request-tp4095619p4095847.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Concurent indexing
Chris: OK, one of those stack traces does have the problem I referenced in the other thread. Are you sending updates to the server with SolrJ? And are you using CloudSolrServer? If you are, I'm surprised... There are the important lines: 1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled frame) 2. - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61 (Compiled frame) 3. - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr. update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame) 4. - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr. client.solrj.request.UpdateRequest, On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh geeri...@gmail.com wrote: Here's another jstack http://pastebin.com/8JiQc3rb On 16 October 2013 11:53, Chris Geeringh geeri...@gmail.com wrote: Hi Erick, here is a paste from other thread (debugging update request) with my input as I am seeing errors too: I ran an import last night, and this morning my cloud wouldn't accept updates. I'm running the latest 4.6 snapshot. I was importing with latest solrj snapshot, and using java bin transport with CloudSolrServer. The cluster had indexed ~1.3 million docs before no further updates were accepted, querying still working. I'll run jstack shortly and provide the results. Here is my jstack output... Lots of blocked threads. http://pastebin.com/1ktjBYbf On 16 October 2013 11:46, Erick Erickson erickerick...@gmail.com wrote: Run jstack on the solr process (standard with Java) and look for the word semaphore. You should see your servers blocked on this in the Solr code. That'll pretty much nail it. There's an open JIRA to fix the underlying cause, see: SOLR-5232, but that's currently slated for 4.6 which won't be cut for a while. Also, there's a patch that will fix this as a side effect, assuming you're using SolrJ, see. This is available in 4.5 SOLR-4816 Best, Erick On Tue, Oct 15, 2013 at 1:33 PM, michael.boom my_sky...@yahoo.com wrote: Here's some of the Solr's last words (log content before it stoped accepting updates), maybe someone can help me interpret that. http://pastebin.com/mv7fH62H -- View this message in context: http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: measure result set quality
bq: How do you compare the quality of your search result in order to decide which schema is better? Well, that's actually a hard problem. There's the various TREC data, but that's a generic solution and most every individual application of this generic thing called search has its own version of good results. Note that scores are NOT comparable across different queries even in the same data set, so don't go down that path. I'd fire the question back at you, Can you define what good (or better) results are in such a way that you can program an evaluation? Often the answer is no... One common technique is to have knowledgable users do what's called A/B testing. You fire the query at two separate Solr instances and display the results side-by-side, and the user says A is more relevant, or B is more relevant. Kind of like an eye doctor. In sophisticated A/B testing, the program randomly changes which side the results go, so you remove sidedness bias. FWIW, Erick On Thu, Oct 17, 2013 at 11:28 AM, Alvaro Cabrerizo topor...@gmail.comwrote: Hi, Imagine the next situation. You have a corpus of documents and a list of queries extracted from production environment. The corpus haven't been manually annotated with relvant/non relevant tags for every query. Then you configure various solr instances changing the schema (adding synonyms, stopwords...). After indexing, you prepare and execute the test over different schema configurations. How do you compare the quality of your search result in order to decide which schema is better? Regards.
XLSB files not indexed
Hi, Can someone tells me if tika is supposed to extract data from xlsb files (the new MS Office format in binary form)? If so then it seems that solr is not able to index them like it is not able to index ODF files (a JIRA is already opened for ODF https://issues.apache.org/jira/browse/SOLR-4809) Can someone confirm the problem, or tell me what to do to make solr works with XLSB files. Regards, Roland.
Re: ExtractRequestHandler, skipping errors
I will open a JIRA issue, I suppose that I just have to create an account first? Regards, Roland. On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi, I think the flag cannot ignore NoSuchMethodError. There may be something wrong here? ... I've just checked my Solr 4.5 directories and I found Tika version is 1.4. Tika 1.4 seems to use commons compress 1.5: http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/** pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory. Can you open a JIRA issue? For now, you can get commons compress 1.5 and put it to the directory (don't forget to remove 1.4.1 jar file). koji (13/10/18 16:37), Roland Everaert wrote: Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException; null:java.lang.**RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.solr.servlet.**SolrDispatchFilter.sendError(** SolrDispatchFilter.java:673) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:383) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:158) at org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(** ApplicationFilterChain.java:**243) at org.apache.catalina.core.**ApplicationFilterChain.**doFilter(** ApplicationFilterChain.java:**210) at org.apache.catalina.core.**StandardWrapperValve.invoke(** StandardWrapperValve.java:222) at org.apache.catalina.core.**StandardContextValve.invoke(** StandardContextValve.java:123) at org.apache.catalina.core.**StandardHostValve.invoke(** StandardHostValve.java:171) at org.apache.catalina.valves.**ErrorReportValve.invoke(** ErrorReportValve.java:99) at org.apache.catalina.valves.**AccessLogValve.invoke(** AccessLogValve.java:953) at org.apache.catalina.core.**StandardEngineValve.invoke(** StandardEngineValve.java:118) at org.apache.catalina.connector.**CoyoteAdapter.service(** CoyoteAdapter.java:408) at org.apache.coyote.http11.**AbstractHttp11Processor.**process(** AbstractHttp11Processor.java:**1023) at org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.** process(AbstractProtocol.java:**589) at org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.** run(AprEndpoint.java:1852) at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown Source) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.**CompressorParser.parse(** CompressorParser.java:102) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**AutoDetectParser.parse(** AutoDetectParser.java:120) at org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(** ExtractingDocumentLoader.java:**219) at org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(** ContentStreamHandlerBase.java:**74) at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** RequestHandlerBase.java:135) at org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.** handleRequest(RequestHandlers.**java:241) at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904) at org.apache.solr.servlet.**SolrDispatchFilter.execute(** SolrDispatchFilter.java:659) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote: Hi, I helped a customer to deployed solr+manifoldCF and everything is going quite smoothly, but every time solr is raising an exception, the manifoldcfjob feeding solr aborts. I would like to know if it is possible to configure the ExtractRequestHandler to ignore errors like it seems to be possible with dataimporthandler and entity processors. I know that it is possible to configure the ExtractRequestHandler to ignore tika exception (We already do that) but the errors that
Re: ExtractRequestHandler, skipping errors
Here is the link to the issue: https://issues.apache.org/jira/browse/SOLR-5365 Thanks for your help. Roland Everaert. On Fri, Oct 18, 2013 at 1:40 PM, Roland Everaert reveatw...@gmail.comwrote: I will open a JIRA issue, I suppose that I just have to create an account first? Regards, Roland. On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jpwrote: Hi, I think the flag cannot ignore NoSuchMethodError. There may be something wrong here? ... I've just checked my Solr 4.5 directories and I found Tika version is 1.4. Tika 1.4 seems to use commons compress 1.5: http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/** pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory. Can you open a JIRA issue? For now, you can get commons compress 1.5 and put it to the directory (don't forget to remove 1.4.1 jar file). koji (13/10/18 16:37), Roland Everaert wrote: Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException; null:java.lang.**RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.solr.servlet.**SolrDispatchFilter.sendError(** SolrDispatchFilter.java:673) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:383) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:158) at org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(** ApplicationFilterChain.java:**243) at org.apache.catalina.core.**ApplicationFilterChain.**doFilter(** ApplicationFilterChain.java:**210) at org.apache.catalina.core.**StandardWrapperValve.invoke(** StandardWrapperValve.java:222) at org.apache.catalina.core.**StandardContextValve.invoke(** StandardContextValve.java:123) at org.apache.catalina.core.**StandardHostValve.invoke(** StandardHostValve.java:171) at org.apache.catalina.valves.**ErrorReportValve.invoke(** ErrorReportValve.java:99) at org.apache.catalina.valves.**AccessLogValve.invoke(** AccessLogValve.java:953) at org.apache.catalina.core.**StandardEngineValve.invoke(** StandardEngineValve.java:118) at org.apache.catalina.connector.**CoyoteAdapter.service(** CoyoteAdapter.java:408) at org.apache.coyote.http11.**AbstractHttp11Processor.**process(** AbstractHttp11Processor.java:**1023) at org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.** process(AbstractProtocol.java:**589) at org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.** run(AprEndpoint.java:1852) at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown Source) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.**CompressorParser.parse(** CompressorParser.java:102) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**AutoDetectParser.parse(** AutoDetectParser.java:120) at org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(** ExtractingDocumentLoader.java:**219) at org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(* *ContentStreamHandlerBase.java:**74) at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** RequestHandlerBase.java:135) at org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.** handleRequest(RequestHandlers.**java:241) at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904) at org.apache.solr.servlet.**SolrDispatchFilter.execute(** SolrDispatchFilter.java:659) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote: Hi, I helped a customer to deployed solr+manifoldCF and everything is going quite smoothly, but every time solr is raising an exception, the manifoldcfjob feeding solr aborts. I would like to know if it is possible to configure the ExtractRequestHandler to ignore errors like
Re: ExtractRequestHandler, skipping errors
Dont, commons compress 1.5 is broken, either use 1.4.1 or later. Our app stopped compressing properly for a maven update. Guido. On 18/10/13 12:40, Roland Everaert wrote: I will open a JIRA issue, I suppose that I just have to create an account first? Regards, Roland. On Fri, Oct 18, 2013 at 12:05 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi, I think the flag cannot ignore NoSuchMethodError. There may be something wrong here? ... I've just checked my Solr 4.5 directories and I found Tika version is 1.4. Tika 1.4 seems to use commons compress 1.5: http://svn.apache.org/viewvc/**tika/tags/1.4/tika-parsers/** pom.xml?view=markuphttp://svn.apache.org/viewvc/tika/tags/1.4/tika-parsers/pom.xml?view=markup But I see commons-compress-1.4.1.jar in solr/contrib/extraction/lib/ directory. Can you open a JIRA issue? For now, you can get commons compress 1.5 and put it to the directory (don't forget to remove 1.4.1 jar file). koji (13/10/18 16:37), Roland Everaert wrote: Hi, We already configure the extractrequesthandler to ignore tika exceptions, but it is solr that complains. The customer manage to reproduce the problem. Following is the error from the solr.log. The file type cause this exception was WMZ. It seems that something is missing in a solr class. We use SOLR 4.4. ERROR - 2013-10-17 18:13:48.902; org.apache.solr.common.**SolrException; null:java.lang.**RuntimeException: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.solr.servlet.**SolrDispatchFilter.sendError(** SolrDispatchFilter.java:673) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:383) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:158) at org.apache.catalina.core.**ApplicationFilterChain.**internalDoFilter(** ApplicationFilterChain.java:**243) at org.apache.catalina.core.**ApplicationFilterChain.**doFilter(** ApplicationFilterChain.java:**210) at org.apache.catalina.core.**StandardWrapperValve.invoke(** StandardWrapperValve.java:222) at org.apache.catalina.core.**StandardContextValve.invoke(** StandardContextValve.java:123) at org.apache.catalina.core.**StandardHostValve.invoke(** StandardHostValve.java:171) at org.apache.catalina.valves.**ErrorReportValve.invoke(** ErrorReportValve.java:99) at org.apache.catalina.valves.**AccessLogValve.invoke(** AccessLogValve.java:953) at org.apache.catalina.core.**StandardEngineValve.invoke(** StandardEngineValve.java:118) at org.apache.catalina.connector.**CoyoteAdapter.service(** CoyoteAdapter.java:408) at org.apache.coyote.http11.**AbstractHttp11Processor.**process(** AbstractHttp11Processor.java:**1023) at org.apache.coyote.**AbstractProtocol$**AbstractConnectionHandler.** process(AbstractProtocol.java:**589) at org.apache.tomcat.util.net.**AprEndpoint$SocketProcessor.** run(AprEndpoint.java:1852) at java.util.concurrent.**ThreadPoolExecutor.runWorker(**Unknown Source) at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.lang.NoSuchMethodError: org.apache.commons.compress.**compressors.**CompressorStreamFactory.** setDecompressConcatenated(Z)V at org.apache.tika.parser.pkg.**CompressorParser.parse(** CompressorParser.java:102) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**CompositeParser.parse(** CompositeParser.java:242) at org.apache.tika.parser.**AutoDetectParser.parse(** AutoDetectParser.java:120) at org.apache.solr.handler.**extraction.**ExtractingDocumentLoader.load(** ExtractingDocumentLoader.java:**219) at org.apache.solr.handler.**ContentStreamHandlerBase.**handleRequestBody(** ContentStreamHandlerBase.java:**74) at org.apache.solr.handler.**RequestHandlerBase.**handleRequest(** RequestHandlerBase.java:135) at org.apache.solr.core.**RequestHandlers$**LazyRequestHandlerWrapper.** handleRequest(RequestHandlers.**java:241) at org.apache.solr.core.SolrCore.**execute(SolrCore.java:1904) at org.apache.solr.servlet.**SolrDispatchFilter.execute(** SolrDispatchFilter.java:659) at org.apache.solr.servlet.**SolrDispatchFilter.doFilter(** SolrDispatchFilter.java:362) ... 16 more On Thu, Oct 17, 2013 at 5:19 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: Hi Roland, (13/10/17 20:44), Roland Everaert wrote: Hi, I helped a customer to deployed solr+manifoldCF and everything is going quite smoothly, but every time solr is raising an exception, the manifoldcfjob feeding solr aborts. I would like to know if it is possible to configure the ExtractRequestHandler to ignore errors like it seems to be possible with dataimporthandler and entity processors. I know that it is possible to configure the
Facet performance
I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries: 1. q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 2. q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0 The only difference is am empty facet.prefix in the first query. The first query returns after some 20 seconds (QTime 2 in the result) while the second one takes only 80 msec (QTime 80). Why is this? And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT. This is with Solr 1.4.
Re: feedback on Solr 4.x LotsOfCores feature
15K cores is around 4 minutes : no network drive, just a spinning disk But, one important thing, to simulate a cold start or an useless linux buffer cache, I used the following command to empty the linux buffer cache : sync echo 3 /proc/sys/vm/drop_caches Then, I started Solr and I found the result above Le 11/10/2013 13:06, Erick Erickson a écrit : bq: sharing the underlying solrconfig object the configset introduced in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode SOLR-4478 will NOT share the underlying config objects, it simply shares the underlying directory. Each core will, at least as presently envisioned, simply read the files that exist there and create their own solrconfig object. Schema objects may be shared, but not config objects. It may turn out to be relatively easy to do in the configset situation, but last time I looked at sharing the underlying config object it was too fraught with problems. bq: 15K cores is around 4 minutes I find this very odd. On my laptop, spinning disk, I think I was seeing 1k cores discovered/sec. You're seeing roughly 16x slower, so I have no idea what's going on here. If this is just reading the files, you should be seeing horrible disk contention. Are you on some kind of networked drive? bq: To do that in background and to block on that request until core discovery is complete, should not work for us (due to the worst case). What other choices are there? Either you have to do it up front or with some kind of blocking. Hmmm, I suppose you could keep some kind of custom store (DB? File? ZooKeeper?) that would keep the last known layout. You'd still have some kind of worst-case situation where the core you were trying to load wouldn't be in your persistent store and you'd _still_ have to wait for the discovery process to complete. bq: and we will use the cores Auto option to create load or only load the core on Interesting. I can see how this could all work without any core discovery but it does require a very specific setup. On Thu, Oct 10, 2013 at 11:42 AM, Soyez Olivier olivier.so...@worldline.commailto:olivier.so...@worldline.com wrote: The corresponding patch for Solr 4.2.1 LotsOfCores can be found in SOLR-5316, including the new Cores options : - numBuckets to create a subdirectory based on a hash on the corename % numBuckets in the core Datadir - Auto with 3 differents values : 1) false : default behaviour 2) createLoad : create, if not exist, and load the core on the fly on the first incoming request (update, select) 3) onlyLoad : load the core on the fly on the first incoming request (update, select), if exist on disk Concerning : - sharing the underlying solrconfig object, the configset introduced in JIRA SOLR-4478 seems to be the solution for non-SolrCloud mode. We need to test it for our use case. If another solution exists, please tell me. We are very interested in such functionality and to contribute, if we can. - the possibility of lotsOfCores in SolrCloud, we don't know in details how SolrCloud is working. But one possible limit is the maximum number of entries that can be added to a zookeeper node. Maybe, a solution will be just a kind of hashing in the zookeeper tree. - the time to discover cores in Solr 4.4 : with spinning disk under linux, all cores with transient=true and loadOnStartup=false, the linux buffer cache empty before starting Solr : 15K cores is around 4 minutes. It's linear in the cores number, so for 50K it's more than 13 minutes. In fact, it corresponding to the time to read all core.properties files. To do that in background and to block on that request until core discovery is complete, should not work for us (due to the worst case). So, we will just disable the core Discovery, because we don't need to know all cores from the start. Start Solr without any core entries in solr.xml, and we will use the cores Auto option to create load or only load the core on the fly, based on the existence of the core on the disk (absolute path calculated from the core name). Thanks for your interest, Olivier De : Erick Erickson [erickerick...@gmail.commailto:erickerick...@gmail.com] Date d'envoi : lundi 7 octobre 2013 14:33 À : solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Objet : Re: feedback on Solr 4.x LotsOfCores feature Thanks for the great writeup! It's always interesting to see how a feature plays out in the real world. A couple of questions though: bq: We added 2 Cores options : Do you mean you patched Solr? If so are you willing to shard the code back? If both are yes, please open a JIRA, attach the patch and assign it to me. bq: the number of file descriptors, it used a lot (need to increase global max and per process fd) Right, this makes sense since you have a bunch of cores all with their own descriptors open. I'm assuming that you hit a rather high max number and
Re: Proximity search with wildcard
Generally in solr if we give Company engage~5 it will give the results containing engage 5 words near to the company. So here I want to get the results if i gave the query with wildcard as Compa* engage~5 - Sayeed -- View this message in context: http://lucene.472066.n3.nabble.com/Proximity-search-with-wildcard-tp4096285p4096354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filter cache pollution during sharded edismax queries
Hi Ken, Have you managed to find out why these entries were stored into filterCache and if they have an impact on the hit ratio ? We noticed the same problem, there are entries of this type : item_+(+(title:western^10.0 | ... in our filterCache. Thanks, Anca On 07/02/2013 09:01 PM, Ken Krugler wrote: Hi all, After upgrading from Solr 3.5 to 4.2.1, I noticed our filterCache hit ratio had dropped significantly. Previously it was at 95+%, but now it's 50%. I enabled recording 100 entries for debugging, and in looking at them it seems that edismax (and faceting) is creating entries for me. This is in a sharded setup, so it's a distributed search. If I do a search for the string bogus text using edismax on two fields, I get an entry in each of the shard's filter caches that looks like: item_+(((field1:bogus | field2:bogu) (field1:text | field2:text))~2): Is this expected? I have a similar situation happening during faceted search, even though my fields are single-value/untokenized strings, and I'm not using the enum facet method. But I'll get many, many entries in the filterCache for facet values, and they all look like item_facet field:facet value: The net result of the above is that even with a very big filterCache size of 2K, the hit ratio is still only 60%. Thanks for any insights, -- Ken -- Ken Krugler +1 530-210-6378 http://www.scaleunlimited.com custom big data solutions training Hadoop, Cascading, Cassandra Solr Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Concurent indexing
Erick, yes. Using SolrJ and CloudSolrServer - both 4.6 snapshots from 13 Oct On 18 October 2013 12:17, Erick Erickson erickerick...@gmail.com wrote: Chris: OK, one of those stack traces does have the problem I referenced in the other thread. Are you sending updates to the server with SolrJ? And are you using CloudSolrServer? If you are, I'm surprised... There are the important lines: 1. - java.util.concurrent.Semaphore.acquire() @bci=5, line=317 (Compiled frame) 2. - org.apache.solr.util.AdjustableSemaphore.acquire() @bci=4, line=61 (Compiled frame) 3. - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr. update.SolrCmdDistributor$Request) @bci=22, line=418 (Compiled frame) 4. - org.apache.solr.update.SolrCmdDistributor.submit(org.apache.solr. client.solrj.request.UpdateRequest, On Wed, Oct 16, 2013 at 2:04 PM, Chris Geeringh geeri...@gmail.com wrote: Here's another jstack http://pastebin.com/8JiQc3rb On 16 October 2013 11:53, Chris Geeringh geeri...@gmail.com wrote: Hi Erick, here is a paste from other thread (debugging update request) with my input as I am seeing errors too: I ran an import last night, and this morning my cloud wouldn't accept updates. I'm running the latest 4.6 snapshot. I was importing with latest solrj snapshot, and using java bin transport with CloudSolrServer. The cluster had indexed ~1.3 million docs before no further updates were accepted, querying still working. I'll run jstack shortly and provide the results. Here is my jstack output... Lots of blocked threads. http://pastebin.com/1ktjBYbf On 16 October 2013 11:46, Erick Erickson erickerick...@gmail.com wrote: Run jstack on the solr process (standard with Java) and look for the word semaphore. You should see your servers blocked on this in the Solr code. That'll pretty much nail it. There's an open JIRA to fix the underlying cause, see: SOLR-5232, but that's currently slated for 4.6 which won't be cut for a while. Also, there's a patch that will fix this as a side effect, assuming you're using SolrJ, see. This is available in 4.5 SOLR-4816 Best, Erick On Tue, Oct 15, 2013 at 1:33 PM, michael.boom my_sky...@yahoo.com wrote: Here's some of the Solr's last words (log content before it stoped accepting updates), maybe someone can help me interpret that. http://pastebin.com/mv7fH62H -- View this message in context: http://lucene.472066.n3.nabble.com/Concurent-indexing-tp4095409p4095642.html Sent from the Solr - User mailing list archive at Nabble.com.
querying nested entity fields
Hi , can some help if below query is possible, Schema: tag categoryA productproduct1/product productproduct2/product /category categoryB productproduct12/product productproduct23/product /category /tag Is it possible to like this q=tag.category:A AND tag.category.product=product1 ??? -- View this message in context: http://lucene.472066.n3.nabble.com/querying-nested-entity-fields-tp4096382.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Facet performance
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: 1. q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 2. q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0 The only difference is am empty facet.prefix in the first query. The first query returns after some 20 seconds (QTime 2 in the result) while the second one takes only 80 msec (QTime 80). Why is this? If you index was just opened when you issued your queries, the first request will be notably slower than the second as the facet values might not be in the disk cache. Furthermore, for enum the difference between no prefix and some prefix is huge. As enum iterates values first (as opposed to fc that iterates hits first), limiting to only the values that starts with 'a' ought to speed up retrieval by a factor 10 or more. And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT. An internal memory structure optimization in Solr limits the amount of possible unique values when using fc. It is not a bug as such, but more a consequence of a choice. Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the too many values-exception. I know too little about the structures for DocValues to say if they will help here, but you might want to take a look at those. - Toke Eskildsen
Re: solrconfig.xml carrot2 params
Thanks, I'm new to the clustering libraries. I finally made this connection when I started browsing through the carrot2 source. I had pulled down a smaller MM document collection from our test environment. It was not ideal as it was mostly structured, but small. I foolishly thought I could cluster on the text copy field before realizing that it was index only. Doh! Our documents are indexed in SolrCloud, but stored in HBase. I want to allow users to page through Solr hits, but would like to cluster on all (or at least several thousand) of the top search hits. Now I'm puzzling over how to efficiently cluster over possibly several thousand Solr hits when the documents are in HBase. I thought an HBase coprocessor, but carrot2 isn't designed for distributed computation. Mahout, in the Hadoop M/R context, seems slow and heavy handed for this scale; maybe, I just need to dig deeper into their library. Or I could just be missing something fundamental? :) -Original Message- From: Stanislaw Osinski stanislaw.osin...@carrotsearch.com Sent: Friday, October 18, 2013 5:04am To: solr-user@lucene.apache.org Subject: Re: solrconfig.xml carrot2 params Hi, Out of curiosity -- what would you like to achieve by changing Tokenizer.documentFields? If you want to have clustering applied to more than one document field, you can provide a comma-separated list of fields in the carrot.title and/or carrot.snippet parameters. Thanks, Staszek -- Stanislaw Osinski, stanislaw.osin...@carrotsearch.com http://carrotsearch.com On Thu, Oct 17, 2013 at 11:49 PM, youknow...@heroicefforts.net youknow...@heroicefforts.net wrote: Would someone help me out with the syntax for setting Tokenizer.documentFields in the ClusteringComponent engine definition in solrconfig.xml? Carrot2 is expecting a Collection of Strings. There's no schema definition for this XML file and a big TODO on the Wiki wrt init params. Every permutation I have tried results in an error stating: Cannot set java.until.Collection field ... to java.lang.String. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Re: how to retireve content page in solr
hi Harshvardhan Ojha; i'm using nutch 1.1 and solr 3.6.0. I mean whole document. I try to create a search engine with nutch and solr and i would obtain a interface like this: name1 http://www.prova.com/name1.html first rows of content document name2 http://www.prova.com/name2.html first rows of content document name3 http://www.prova.com/name3.html first rows of content document any ideas? Thanks Danilo -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-retireve-content-page-in-solr-tp4096302p4096333.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr timeout after reboot
I have a SolrCloud environment with 4 shards, each having a replica and a leader. The index size is about 70M docs and 60Gb, running with Jetty + Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM. I'm using SolrMeter for stress testing. If I restart Jetty and then try to use SolrMeter to bomb an instance with queries, using a query per minute rate of 3000 then that solr instance somehow timesout and I need to restart it again. If instead of using 3000 qpm i startup slowly with 200 for a minute or two, then 1800 and then 3000 everything is good. I assume this happens because Solr is not warmed up. What settings could I tweak so that Solr doesn't time out anymore when getting many requests? Is there a way to limit how many req it can serve? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html Sent from the Solr - User mailing list archive at Nabble.com.
Fwd: Searching within list of regions with 1:1 document-region mapping
Hi, I have a Solr index of around 100 million documents with each document being given a region id growing at a rate of about 10 million documents per month - the average document size being aronud 10KB of pure text. The total number of region ids are themselves in the range of 2.5 million. I want to search for a query with a given list of region ids. The number of region ids in this list is usually around 250-300 (most of the time), but can be upto 500, with a maximum cap of around 2000 ids in one request. What is the best way to model such queries besides using an IN param in the query, or using a Filter FQ in the query or some other means? If it may help, the index is on a VM with 4 virtual-cores and has currently 4GB of Java memory allocated out of 16GB in the machine. The number of queries do not exceed more than 1 per minute for now. If needed, we can throw more hardware to the index - but the index will still be only on a single machine for atleast 6 months. Best Regards, Sandeep Gupta
RE: Facet performance
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: 1. q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 2. q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0 The only difference is am empty facet.prefix in the first query. The first query returns after some 20 seconds (QTime 2 in the result) while the second one takes only 80 msec (QTime 80). Why is this? If you index was just opened when you issued your queries, the first request will be notably slower than the second as the facet values might not be in the disk cache. I know but it shouldn't be orders of magnitudes as in this example, should it? Furthermore, for enum the difference between no prefix and some prefix is huge. As enum iterates values first (as opposed to fc that iterates hits first), limiting to only the values that starts with 'a' ought to speed up retrieval by a factor 10 or more. Thanks. That is what we sort of figured but it's good to know for sure. Of course it begs the question if there is a way to speed this up? And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT. An internal memory structure optimization in Solr limits the amount of possible unique values when using fc. It is not a bug as such, but more a consequence of a choice. Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the too many values-exception. I know too little about the structures for DocValues to say if they will help here, but you might want to take a look at those. What is DocValues? Haven't heard of it yet. And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Michael
Re: Check if dynamic columns exists and query else ignore
Bumping this one, any suggestions? Looks like if() and exists() are meant to solve this problem, but I am using it in a wrong way. -Utkarsh On Thu, Oct 17, 2013 at 1:16 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I trying to do this: if (US_offers_i exists): fq=US_offers_i:[1 TO *] else: fq=offers_count:[1 TO *] Where: US_offers_i is a dynamic field containing an int offers_count is a status field containing an int. I have tried this so far but it doesn't work: http://solr_server/solr/col1/select? q=iphone+5s fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *]) Also, there is a heavy performance penalty for this condition? I am planning to use this for all my queries. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Issues with Language detection in Solr
Hi All,I am trying to detect the language of the business name filed and the address field. I am using Solr's lang Detect(Google Library) , not Tika. It works ok in most of the cases but in some it detects the language wrongly.For an example the document -OrgName: EXPLOITS VALLEY HIGHGREENWOOD,StreetLine1: 19 GREENWOOD AVE, StreetLine2: ,SOrgName: EXPLOITS VALLEY HIGHGREENWOOD, StandardizedStreetLine1: 19 GREENWOOD AVE,language_s: [ de]Language is detected as German(de) here , which is wrong.Below is my configuration-+ OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1 language_s 0.9 en +Why there is an issue?Why the language detection is wrong ?Please help !Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issues with Language detection in Solr
I would say that in general you need at least 15 or 20 words in a text field for language to be detected reasonably well. Sure, sometimes it can work for 8 to 12 words, but flip a coin how reliable it will be. You haven't shown us any true text fields. I would say that language detection against simple name fields is a misuse of the language detection feature. I mean, it is designed for larger blocks of text, not very short phrases. See some examples in my e-book. -- Jack Krupansky -Original Message- From: vibhoreng04 Sent: Friday, October 18, 2013 2:01 PM To: solr-user@lucene.apache.org Subject: Issues with Language detection in Solr Hi All,I am trying to detect the language of the business name filed and the address field. I am using Solr's lang Detect(Google Library) , not Tika. It works ok in most of the cases but in some it detects the language wrongly.For an example the document -OrgName: EXPLOITS VALLEY HIGHGREENWOOD,StreetLine1: 19 GREENWOOD AVE, StreetLine2: ,SOrgName: EXPLOITS VALLEY HIGHGREENWOOD, StandardizedStreetLine1: 19 GREENWOOD AVE,language_s: [ de]Language is detected as German(de) here , which is wrong.Below is my configuration-+ OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1 language_s 0.9 en +Why there is an issue?Why the language detection is wrong ?Please help !Vibhor -- View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.html Sent from the Solr - User mailing list archive at Nabble.com.
Seeking New Moderators for solr-user@lucene
It looks like it's time to inject some fresh blood into the solr-user@lucene moderation team. If you'd like to volunteer to be a moderator, please reply back to this thread and specify which email address you'd like to use as a moderator (if different from the one you use when sending the email) Being a moderator is really easy: you'll get a some extra emails in your inbox with MODERATE in the subject, which you skim to see if they are spam -- if they are you delete them, if not you reply all to let them get sent to the list, and authorize that person to send future messages w/o moderation. Occasionally, you'll see an explicit email to solr-user-owner@lucene from a user asking for help realted to their subscription (usually unsubscribing problems) and you and the other moderators chime in with assistance when possible. More details can be found here... https://wiki.apache.org/solr/MailingListModeratorInfo (I'll wait ~72+ hours to see who responds, and then file the appropriate jira with INFRA) -Hoss
Re: Switching indexes
I was able to get the new collections working dynamically (via Collections RESTful calls). I was having some other issues with my development environment that I had to fix up to get it going. I had to upgrade to 4.5 in order for the aliases to work at all though. Not sure what the deal was with that. Thanks Shawn -- I have a much better understanding of all this now. -- Chris On Thu, Oct 17, 2013 at 7:31 PM, Shawn Heisey s...@elyograg.org wrote: On 10/17/2013 12:51 PM, Christopher Gross wrote: OK, super confused now. http://index1:8080/solr/admin/**cores?action=CREATEname=** test2collection=test2**numshards=1replicationFactor=**3http://index1:8080/solr/admin/cores?action=CREATEname=test2collection=test2numshards=1replicationFactor=3 Nets me this: response lst name=responseHeader int name=status400/int int name=QTime15007/int /lst lst name=error str name=msgError CREATEing SolrCore 'test2': Could not find configName for collection test2 found:[xxx, xxx, , x, xx]/str int name=code400/int /lst /response For that node (test2), in my solr data directory, I have a folder with the conf files and an existing data dir (copied the index from another location). Right now it seems like the only way that I can add in a collection is to load the configs into zookeeper, stop tomcat, add it to the solr.xml file, and restart tomcat. The config does need to be loaded into zookeeper. That's how SolrCloud works. Because you have existing collections, you're going to have at least one config set already uploaded, you may be able to use that directly. You don't need to stop anything, though. Michael Della Bitta's response indicates the part you're missing on your create URL - the collection.configName parameter. The basic way to get things done with collections is this: 1) Upload one or more named config sets to zookeeper. This can be done with zkcli and its upconfig command, or with the bootstrap startup options that are intended to be used once. 2) Create the collection, referencing the proper collection.configName. You can have many collections that all share one config name. You can also change which config an existing collection uses with the zkcli linkconfig command, followed by a collection reload. If you upload a new configuration with an existing name, a collection reload (or Solr restart) is required to use the new config. For uploading configs, I find zkcli to be a lot cleaner than the bootstrap options - it doesn't require stopping Solr or giving it different startup options. Actually, it doesn't even require Solr to be started - it talks only to zookeeper, and we strongly recommend standalone zookeeper, not the zk server that can be run embedded in Solr. Thanks, Shawn
Re: Check if dynamic columns exists and query else ignore
: I trying to do this: : : if (US_offers_i exists): :fq=US_offers_i:[1 TO *] : else: :fq=offers_count:[1 TO *] if() and exist() are functions, so you would have to explicitly use them in a function context (ie: {!func} parser, or {!frange} parser) and to use those nested queries inside of functions you'd need to use the query() function. but nothing about your problem description suggests that you really need to worry about this. If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *] won't match that document, and neither will US_offers_i:[* TO *] -- so you can implement the logic you describe with a simple query... fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *])) Which you can read as Match does with 1 or more US offers, or: docs that have 1 or more offers but no US offer field at all : Also, there is a heavy performance penalty for this condition? I am : planning to use this for all my queries. Any logic that you do at query time, which can be precomputed into a specific field in your index will *always* make the queries faster (at the expense of a little more time spent indexing and a little more disk used). If you know in advance that you are frequently going to want to ristrict on this type of logic, then unless you index docs more offten then you search docs, you should almost certainly index as has_offers boolean field that captures this logic. -Hoss
Re: Issues with Language detection in Solr
I agree with you Jack . But I request you to see here that still this filter works perfectly fine .Only in one case case where even all the words are latin , the language is getting detected as German.My question is why and how ? If it works perfectly for the other docs what in this case is making it to do abnormal behaiour ? -- View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Seeking New Moderators for solr-user@lucene
Hey Hoss, I'd be happy to moderate. Sent from my iPhone On 19-Oct-2013, at 0:22, Chris Hostetter hossman_luc...@fucit.org wrote: It looks like it's time to inject some fresh blood into the solr-user@lucene moderation team. If you'd like to volunteer to be a moderator, please reply back to this thread and specify which email address you'd like to use as a moderator (if different from the one you use when sending the email) Being a moderator is really easy: you'll get a some extra emails in your inbox with MODERATE in the subject, which you skim to see if they are spam -- if they are you delete them, if not you reply all to let them get sent to the list, and authorize that person to send future messages w/o moderation. Occasionally, you'll see an explicit email to solr-user-owner@lucene from a user asking for help realted to their subscription (usually unsubscribing problems) and you and the other moderators chime in with assistance when possible. More details can be found here... https://wiki.apache.org/solr/MailingListModeratorInfo (I'll wait ~72+ hours to see who responds, and then file the appropriate jira with INFRA) -Hoss
Re: Questions developing custom functionquery
: Field-Type: org.apache.solr.schema.TextField ... : DocTermsIndexDocValueshttp://grepcode.com/file/repo1.maven.org/maven2/org.apache.lucene/lucene-queries/4.3.0/org/apache/lucene/queries/function/docvalues/DocTermsIndexDocValues.java#DocTermsIndexDocValues. : Calling getVal() on a DocTermsIndexDocValues does some really weird stuff : that I really don't understand. Your TextField is being analyzed in some way you haven't clarified, and the DocTermsIndexDocValues you get contains the details of each term in that TextField : Its possible I'm going about this wrong and need to re-do my approach. I'm : just currently at a loss for what that approach is. Based on your initial goal, you are most certainly going about this in a much more complicated way then you need to... :My goal is to be able to implement a custom sorting technique. :Example: str name=resname/some :example/data/here/2013/09/12/testing.text/str : :I would like to do a custom sort based on this resname field. :Basically, I would like to parse out that date there (2013/09/12) and : sort :on that date. You are going to be *MUCH* happier (both in terms of effort, and in terms of performance) if instead of writing a custom function to parse strings at query time when sorting, you implement the parsing logic when indexing the doc and index it up front as a date field that you can sort on. I would suggest something like CloneFieldUpdateProcessorFactory + RegexReplaceProcessorFactory could save you the work of needing to implement any custom logic -- but as Jack pointed out in SOLR-4864 it doesn't currently allow you to do capture group replacements (but maybe you could contribute a patch to fix that instead of needing to write completely custom code for yourself) Of maybe, as is, you could use RegexReplaceProcessorFactory to throw away non digits - and then use ParseDateFieldUpdateProcessorFactory to get what you want? (I'm not certain - i haven't played with ParseDateFieldUpdateProcessorFactory much) https://issues.apache.org/jira/browse/SOLR-4864 https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.html https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html https://lucene.apache.org/solr/4_5_0/solr-core/org/apache/solr/update/processor/ParseDateFieldUpdateProcessorFactory.html -Hoss
Re: Seeking New Moderators for solr-user@lucene
Hi Chris, I would like to moderate and you can use the mail id vibhoren...@gmail.com for this purpose . Regards, Vibhor Jaiswal -- View this message in context: http://lucene.472066.n3.nabble.com/Seeking-New-Moderators-for-solr-user-lucene-tp4096447p4096448.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Seeking New Moderators for solr-user@lucene
Hello! I can help with moderation. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - ElasticSearch It looks like it's time to inject some fresh blood into the solr-user@lucene moderation team. If you'd like to volunteer to be a moderator, please reply back to this thread and specify which email address you'd like to use as a moderator (if different from the one you use when sending the email) Being a moderator is really easy: you'll get a some extra emails in your inbox with MODERATE in the subject, which you skim to see if they are spam -- if they are you delete them, if not you reply all to let them get sent to the list, and authorize that person to send future messages w/o moderation. Occasionally, you'll see an explicit email to solr-user-owner@lucene from a user asking for help realted to their subscription (usually unsubscribing problems) and you and the other moderators chime in with assistance when possible. More details can be found here... https://wiki.apache.org/solr/MailingListModeratorInfo (I'll wait ~72+ hours to see who responds, and then file the appropriate jira with INFRA) -Hoss
Re: Facet performance
DocValues is the new black http://wiki.apache.org/solr/DocValues Otis -- Solr ElasticSearch Support -- http://sematext.com/ SOLR Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com wrote: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: 1. q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 2. q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0 The only difference is am empty facet.prefix in the first query. The first query returns after some 20 seconds (QTime 2 in the result) while the second one takes only 80 msec (QTime 80). Why is this? If you index was just opened when you issued your queries, the first request will be notably slower than the second as the facet values might not be in the disk cache. I know but it shouldn't be orders of magnitudes as in this example, should it? Furthermore, for enum the difference between no prefix and some prefix is huge. As enum iterates values first (as opposed to fc that iterates hits first), limiting to only the values that starts with 'a' ought to speed up retrieval by a factor 10 or more. Thanks. That is what we sort of figured but it's good to know for sure. Of course it begs the question if there is a way to speed this up? And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT. An internal memory structure optimization in Solr limits the amount of possible unique values when using fc. It is not a bug as such, but more a consequence of a choice. Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the too many values-exception. I know too little about the structures for DocValues to say if they will help here, but you might want to take a look at those. What is DocValues? Haven't heard of it yet. And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Michael
RE: Facet performance
: 1. q=wordfacet.field=CONTENTfacet=truefacet.prefix=facet.limit=10facet.mincount=1facet.method=enumrows=0 : 2. q=wordfacet.field=CONTENTfacet=truefacet.prefix=afacet.limit=10facet.mincount=1facet.method=enumrows=0 : : The only difference is am empty facet.prefix in the first query. : If you index was just opened when you issued your queries, the first : request will be notably slower than the second as the facet values might : not be in the disk cache. : : I know but it shouldn't be orders of magnitudes as in this example, should it? in and of itself: it can be if your index is large enough and none of the disk pages are in the file system buffer. more significantly however, is that depending on how big your filterCache is, the first request could eaisly be caching all of filters needed for the second query -- at a minimum it's definitely caching your main query which will be re-used and save a lot of time independent of hte faceting. -Hoss
SOLRJ replace document
How do I replace a document in solr using solrj library? I keep getting this error back: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Atomic document updates are not supported unless updateLog/ is configured I don't want to do partial updates, I just want to replace it... Thanks, Brent
Re: Check if dynamic columns exists and query else ignore
Thanks Chris! That worked! I overengineered my query! Thanks, -Utkarsh On Fri, Oct 18, 2013 at 12:02 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I trying to do this: : : if (US_offers_i exists): :fq=US_offers_i:[1 TO *] : else: :fq=offers_count:[1 TO *] if() and exist() are functions, so you would have to explicitly use them in a function context (ie: {!func} parser, or {!frange} parser) and to use those nested queries inside of functions you'd need to use the query() function. but nothing about your problem description suggests that you really need to worry about this. If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *] won't match that document, and neither will US_offers_i:[* TO *] -- so you can implement the logic you describe with a simple query... fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *])) Which you can read as Match does with 1 or more US offers, or: docs that have 1 or more offers but no US offer field at all : Also, there is a heavy performance penalty for this condition? I am : planning to use this for all my queries. Any logic that you do at query time, which can be precomputed into a specific field in your index will *always* make the queries faster (at the expense of a little more time spent indexing and a little more disk used). If you know in advance that you are frequently going to want to ristrict on this type of logic, then unless you index docs more offten then you search docs, you should almost certainly index as has_offers boolean field that captures this logic. -Hoss -- Thanks, -Utkarsh
loading djvu xml into solr
Does anyone have a schema they'd be willing to share for loading djvu xml into solr?
Re: loading djvu xml into solr
On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote: Does anyone have a schema they'd be willing to share for loading djvu xml into solr? I assume that djvu XML is a particular XML format? In which case, there is no schema that can do it. That's not how Solr works. You need to use the XML format expected by Solr. Or, you can add tr=.xsl to the URL, and use an XSL stylesheet to transform your XML into Solr's XML format. The schema defines the fields that are present in the index, not the format of the XML used. Upayavira
Re: Solr 4.3 Startup with Multiple Cores Hangs on Registering Core
Hello, I still have this issue using Solr 4.4, removing firstSearcher queries did make the problem go away. Note that I'm using Tomcat 7 and that if I'm using my own Java application launching an Embedded Solr Server pointing to the same Solr configuration the server fully starts with no hang. What is the xml tag syntax to have spellcheck=false for firstSearcher discussed above? Cheers, /jonatan --- HANG with Tomcat 7 (firstSearcher queries on) --- ... 2409 [coreLoadExecutor-3-thread-3] INFO org.apache.solr.handler.component.SpellCheckComponent – No queryConverter defined, using default converter 2409 [coreLoadExecutor-3-thread-3] INFO org.apache.solr.handler.component.QueryElevationComponent – Loading QueryElevation from: /var/lib/myapp/conf/elevate.xml 2415 [coreLoadExecutor-3-thread-3] INFO org.apache.solr.handler.ReplicationHandler – Commits will be reserved for 1 2415 [searcherExecutor-16-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener sending requests to Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23 _9(4.4):C57862)} 2417 [searcherExecutor-16-thread-1] INFO org.apache.solr.core.SolrCore – [foo-20130912] webapp=null path=null params={event=firstSearcherq=static+firstSearcher+warming+in+solrconfig.xmldistrib=false} hits=0 status=0 QTime=1 2417 [searcherExecutor-16-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener done. 2417 [searcherExecutor-16-thread-1] INFO org.apache.solr.handler.component.SpellCheckComponent – Loading spell index for spellchecker: default 2417 [searcherExecutor-16-thread-1] INFO org.apache.solr.handler.component.SpellCheckComponent – Loading spell index for spellchecker: wordbreak 2418 [searcherExecutor-16-thread-1] INFO org.apache.solr.core.SolrCore – [foo-20130912] Registered new searcher Searcher@5c43ecf0main{StandardDirectoryReader(segments_3:23 _9(4.4):C57862)} 2420 [coreLoadExecutor-3-thread-3] INFO org.apache.solr.core.CoreContainer – registering core: foo-20130912 --- NO HANG EmbeddedSolrServer (firstSearcher queries on) --- ... 1797 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.handler.component.SpellCheckComponent – No queryConverter defined, using default converter 1797 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.handler.component.QueryElevationComponent – Loading QueryElevation from: /var/lib/myapp/conf/elevate.xml 1800 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.handler.ReplicationHandler – Commits will be reserved for 1 1801 [searcherExecutor-15-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener sending requests to Searcher@27b104d7main{StandardDirectoryReader(segments_3:23 _9(4.4):C57862)} 1801 [searcherExecutor-15-thread-1] INFO org.apache.solr.core.SolrCore – QuerySenderListener done. 1801 [searcherExecutor-15-thread-1] INFO org.apache.solr.handler.component.SpellCheckComponent – Loading spell index for spellchecker: default 1801 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.CoreContainer – registering core: foo-20130912 1801 [searcherExecutor-15-thread-1] INFO org.apache.solr.handler.component.SpellCheckComponent – Loading spell index for spellchecker: wordbreak 1801 [searcherExecutor-15-thread-1] INFO org.apache.solr.core.SolrCore – [foo-20130912] Registered new searcher Searcher@27b104d7main{StandardDirectoryReader(segments_3:23 _9(4.4):C57862)} On Fri, Sep 6, 2013 at 4:29 PM, Austin Rasmussen arasmus...@directs.comwrote: : Do all of your cores have newSearcher event listners configured or just : 2 (i'm trying to figure out if it's a timing fluke that these two are stalled, or if it's something special about the configs) All of my cores have both the newSearcher and firstSearcher event listeners configured. (The firstSearcher actually doesn't have any queries configured against it, so it probably should just be removed altogether) : Can you try removing the newSearcher listners to confirm that that does in fact make the problem go away? Removing the newSearcher listeners does not make the problem go away; however, removing the firstSearcher listener (even if the newSearcher listener is still configured) does make the problem go away. : With the newSearcher listeners in place, Can you try setting spellcheck=false as a query param on the newSearcher listeners you have configured and : see if that works arround the problem? Adding the spellcheck=false param to the firstSearcher listener does appear to work around the problem. : Assuming it's just 2 cores using these listeners: can you reproduce this problem with a simpler seup where only one of the affected cores is in use? Since it's not just these two cores, I'm not sure how to produce much of a simpler setup. I did attempt to limit how many cores are loaded in the solr.xml, and found that if I cut it down to 56, it was able to load successfully (without any of the above config changed). If I cut it down to 57 cores, it doesn't
Re: SOLRJ replace document
To replace a Solr document, simply add it again using the same technique used to insert the original document. The set option for atomic update is only used when you wish to selectively update only some of the fields for a document, and that does require that the update log be enabled using updateLog. -- Jack Krupansky -Original Message- From: Brent Ryan Sent: Friday, October 18, 2013 4:59 PM To: solr-user@lucene.apache.org Subject: SOLRJ replace document How do I replace a document in solr using solrj library? I keep getting this error back: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Atomic document updates are not supported unless updateLog/ is configured I don't want to do partial updates, I just want to replace it... Thanks, Brent
Re: SOLRJ replace document
I wish that was the case but calling addDoc() is what's triggering that exception. On Friday, October 18, 2013, Jack Krupansky wrote: To replace a Solr document, simply add it again using the same technique used to insert the original document. The set option for atomic update is only used when you wish to selectively update only some of the fields for a document, and that does require that the update log be enabled using updateLog. -- Jack Krupansky -Original Message- From: Brent Ryan Sent: Friday, October 18, 2013 4:59 PM To: solr-user@lucene.apache.org Subject: SOLRJ replace document How do I replace a document in solr using solrj library? I keep getting this error back: org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException: Atomic document updates are not supported unless updateLog/ is configured I don't want to do partial updates, I just want to replace it... Thanks, Brent
Re: SOLRJ replace document
On 10/18/2013 2:59 PM, Brent Ryan wrote: How do I replace a document in solr using solrj library? I keep getting this error back: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Atomic document updates are not supported unless updateLog/ is configured I don't want to do partial updates, I just want to replace it... Replacing a document is done by simply adding the document, in the same way as if you were adding a new one. If you have properly configured Solr, the old one will be deleted before the new one is inserted. Properly configuring Solr means that you have a uniqueKey field in yourschema, and that it is a simple type like string, int, long, etc, and is not multivalued. A TextField type that is tokenized cannot be used as the uniqueKey field. Thanks, Shawn
Re: loading djvu xml into solr
Ah, thanks for the clarification - I was having a serious misunderstanding! (As you can tell I'm newly off the tutorial and blundering ahead...) On Oct 18, 2013, at 2:22 PM, Upayavira wrote: On Fri, Oct 18, 2013, at 10:11 PM, Sara Amato wrote: Does anyone have a schema they'd be willing to share for loading djvu xml into solr? I assume that djvu XML is a particular XML format? In which case, there is no schema that can do it. That's not how Solr works. You need to use the XML format expected by Solr. Or, you can add tr=.xsl to the URL, and use an XSL stylesheet to transform your XML into Solr's XML format. The schema defines the fields that are present in the index, not the format of the XML used. Upayavira
Re: SOLRJ replace document
My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. Brent On Friday, October 18, 2013, Shawn Heisey wrote: On 10/18/2013 2:59 PM, Brent Ryan wrote: How do I replace a document in solr using solrj library? I keep getting this error back: org.apache.solr.client.solrj.**impl.HttpSolrServer$**RemoteSolrException: Atomic document updates are not supported unless updateLog/ is configured I don't want to do partial updates, I just want to replace it... Replacing a document is done by simply adding the document, in the same way as if you were adding a new one. If you have properly configured Solr, the old one will be deleted before the new one is inserted. Properly configuring Solr means that you have a uniqueKey field in yourschema, and that it is a simple type like string, int, long, etc, and is not multivalued. A TextField type that is tokenized cannot be used as the uniqueKey field. Thanks, Shawn
Re: Issues with Language detection in Solr
Sorry, but Latin is not on the list of supported languages: https://code.google.com/p/language-detection/wiki/LanguageList -- Jack Krupansky -Original Message- From: vibhoreng04 Sent: Friday, October 18, 2013 3:07 PM To: solr-user@lucene.apache.org Subject: Re: Issues with Language detection in Solr I agree with you Jack . But I request you to see here that still this filter works perfectly fine .Only in one case case where even all the words are latin , the language is getting detected as German.My question is why and how ? If it works perfectly for the other docs what in this case is making it to do abnormal behaiour ? -- View this message in context: http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433p4096443.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLRJ replace document
On 10/18/2013 3:36 PM, Brent Ryan wrote: My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. If this doesn't sound like what's going on, can you share your code, or a simplification of the SolrJ parts of it? Thanks, Shawn
Re: Seeking New Moderators for solr-user@lucene
I'll be happy to moderate. I do it for some other lists already. Regards, Alex
Leader election fails in some point.
Hi, In this screenshot I have a shard with two replicas without leader, http://picpaste.com/qf2jdkj8.png On machine with shard green I found this exception: INFO - dat5 - 2013-10-18 22:48:04.775; org.apache.solr.handler.admin.CoreAdminHandler; Going to wait for coreNodeName: 192.168.20.106:8983_solr_statistics-13_shard18_replica4, state: recovering, checkLive: true, onlyIfLeader: true ERROR - dat5 - 2013-10-18 22:48:04.775; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: We are not the leader at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:824) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:192) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:655) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:246) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) -- at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Unknown Source) On the machine with the shard in recovery state I found this exception: INFO - dat6 - 2013-10-18 22:48:44.131; org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process for shard shard18 INFO - dat6 - 2013-10-18 22:48:44.137; org.apache.solr.cloud.ShardLeaderElectionContext; Checking if I should try and be the leader. INFO - dat6 - 2013-10-18 22:48:44.138; org.apache.solr.cloud.ShardLeaderElectionContext; My last published State was recovering, I won't be the leader. INFO - dat6 - 2013-10-18 22:48:44.139; org.apache.solr.cloud.ShardLeaderElectionContext; There may be a better leader candidate than us - going back into recovery INFO - dat6 - 2013-10-18 22:48:44.142; org.apache.solr.update.DefaultSolrCoreState; Running recovery - first canceling any ongoing recovery WARN - dat6 - 2013-10-18 22:48:44.142; org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for zkNodeName=192.168.20.106:8983_solr_statistics-13_shard18_replica4core=statistics-13_shard18_replica4 INFO - dat6 - 2013-10-18 22:48:45.131; org.apache.solr.cloud.RecoveryStrategy; Finished recovery process. core=statistics-13_shard18_replica4 INFO - dat6 - 2013-10-18 22:48:45.131; org.apache.solr.cloud.RecoveryStrategy; Starting recovery process. core=statistics-13_shard18_replica4 recoveringAfterStartup=false INFO - dat6 - 2013-10-18 22:48:45.131; org.apache.solr.cloud.ZkController; publishing core=statistics-13_shard18_replica4 state=recovering INFO - dat6 - 2013-10-18 22:48:45.132; org.apache.solr.cloud.ZkController; numShards not found on descriptor - reading it from system property INFO - dat6 - 2013-10-18 22:48:45.141; org.apache.solr.client.solrj.impl.HttpClientUtil; Creating new http client, config:maxConnections=128maxConnectionsPerHost=32followRedirects=false ERROR - dat6 - 2013-10-18 22:48:45.143; org.apache.solr.common.SolrException; Error while trying to recover. core=statistics-13_shard18_replica4:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: We are not the leader at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) No leader means we can't index data because a 503 http status code is returned. Is this the normal behaviour or a bug? - Best regards -- View this message in context:
Re: Solr timeout after reboot
Michael, The servlet container controls timeouts, max threads and such. That's not a high query rate, but yes, it could be solr or OS caches are cold. You will ne able too see all this in SPM for Solr while you hammer your poor Solr servers :) Otis Solr ElasticSearch Support http://sematext.com/ On Oct 18, 2013 11:38 AM, michael.boom my_sky...@yahoo.com wrote: I have a SolrCloud environment with 4 shards, each having a replica and a leader. The index size is about 70M docs and 60Gb, running with Jetty + Zookeeper, on 2 EC2 instances, each with 4CPUs and 15G RAM. I'm using SolrMeter for stress testing. If I restart Jetty and then try to use SolrMeter to bomb an instance with queries, using a query per minute rate of 3000 then that solr instance somehow timesout and I need to restart it again. If instead of using 3000 qpm i startup slowly with 200 for a minute or two, then 1800 and then 3000 everything is good. I assume this happens because Solr is not warmed up. What settings could I tweak so that Solr doesn't time out anymore when getting many requests? Is there a way to limit how many req it can serve? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: XLSB files not indexed
Hi Roland, It looks like: Tika - yes Solr - no? Based on http://search-lucene.com/?q=xlsb ODF != XLSB though, I think... Otis -- Solr ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 7:36 AM, Roland Everaert reveatw...@gmail.com wrote: Hi, Can someone tells me if tika is supposed to extract data from xlsb files (the new MS Office format in binary form)? If so then it seems that solr is not able to index them like it is not able to index ODF files (a JIRA is already opened for ODF https://issues.apache.org/jira/browse/SOLR-4809) Can someone confirm the problem, or tell me what to do to make solr works with XLSB files. Regards, Roland.
Re: SolrCloud Performance Issue
Hi, What happens if you have just 1 shard - no distributed search, like before? SPM for Solr or any other monitoring tool that captures OS and Solr metrics should help you find the source of the problem faster. Is disk IO the same? utilization of caches? JVM version, heap, etc.? CPU usage? network? I'd look at each of these things side by side and look for big differences. Otis -- Solr ElasticSearch Support -- http://sematext.com/ SOLR Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 1:38 AM, shamik sham...@gmail.com wrote: I tried commenting out NOW in bq, but didn't make any difference in the performance. I do see minor entry in the queryfiltercache rate which is a meager 0.02. I'm really struggling to figure out the bottleneck, any known pain points I should be checking ? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Performance-Issue-tp4095971p4096277.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLRJ replace document
So I think the issue might be related to the tech stack we're using which is SOLR within DataStax enterprise which doesn't support atomic updates. But I think it must have some sort of bug around this because it doesn't appear to work correctly for this use case when using solrj ... Anyways, I've contacted support so lets see what they say. On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote: On 10/18/2013 3:36 PM, Brent Ryan wrote: My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. If this doesn't sound like what's going on, can you share your code, or a simplification of the SolrJ parts of it? Thanks, Shawn
Re: SOLRJ replace document
Keep in mind that DataStax has a custom update handler, and as such isn't exactly a vanilla Solr implementation (even though in many ways it still is). Since updates are co-written to Cassandra and Solr you should always tread a bit carefully when slightly outside what they perceive to be norms. On Oct 18, 2013, at 7:21 PM, Brent Ryan brent.r...@gmail.com wrote: So I think the issue might be related to the tech stack we're using which is SOLR within DataStax enterprise which doesn't support atomic updates. But I think it must have some sort of bug around this because it doesn't appear to work correctly for this use case when using solrj ... Anyways, I've contacted support so lets see what they say. On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote: On 10/18/2013 3:36 PM, Brent Ryan wrote: My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. If this doesn't sound like what's going on, can you share your code, or a simplification of the SolrJ parts of it? Thanks, Shawn
Re: SOLRJ replace document
By all means please do file a support request with DataStax, either as an official support ticket or as a question on StackOverflow. But, I do think the previous answer of avoiding the use of a Map object in your document is likely to be the solution. -- Jack Krupansky -Original Message- From: Brent Ryan Sent: Friday, October 18, 2013 10:21 PM To: solr-user@lucene.apache.org Subject: Re: SOLRJ replace document So I think the issue might be related to the tech stack we're using which is SOLR within DataStax enterprise which doesn't support atomic updates. But I think it must have some sort of bug around this because it doesn't appear to work correctly for this use case when using solrj ... Anyways, I've contacted support so lets see what they say. On Fri, Oct 18, 2013 at 5:51 PM, Shawn Heisey s...@elyograg.org wrote: On 10/18/2013 3:36 PM, Brent Ryan wrote: My schema is pretty simple and has a string field called solr_id as my unique key. Once I get back to my computer I'll send some more details. If you are trying to use a Map object as the value of a field, that is probably why it is interpreting your add request as an atomic update. If this is the case, and you're doing it because you have a multivalued field, you can use a List object rather than a Map. If this doesn't sound like what's going on, can you share your code, or a simplification of the SolrJ parts of it? Thanks, Shawn