Re: NRT or similar for Solr 3.5?
The Onclick handler does not seem to be called on google chrome (Ubuntu ). Also , I dont seem to receive the email with the confirmation link on registering (I have checked my spam) Regards Vikram Kamath 2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: There is an onclick handler that allows you to download the src. BTW, an early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is available for download. So please give it a try. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/10/2011 11:18 PM, Steven Ou wrote: All the links on the download section link to http://solr-ra.tgels.org/# -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: Not sure why you had problems, #downloads ( http://solr-ra.tgels.org/#downloads ) should point you to the downloads section showing the different versions available for download ? Please share if this is not so ( there were downloads yesterday with no problems ) Regarding NRT, you can switch between RA and Lucene at query level or at config level; in the current version with RA, NRT is in effect while with lucene, it is not, you can get more information from here: http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf Solr 3.5 with RankingAlgorithm 1.3 should be available next week. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/9/2011 4:49 PM, Steven Ou wrote: Hey Nagendra, I took a look and Solr-RA looks promising - but: - I could not figure out how to download it. It seems like all the download links just point to # - I wasn't looking for another ranking algorithm, so would it be possible for me to use NRT but *not* RA (i.e. just use the normal Lucene library)? -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Steven: Please take a look at Solr with RankingAlgorithm. It offers NRT functionality. You can set your autoCommit to about 15 mins. You can get more information from here: http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.**org http://rankingalgorithm.tgels.org On 12/8/2011 9:30 PM, Steven Ou wrote: Hi guys, I'm looking for NRT functionality or similar in Solr 3.5. Is that possible? From what I understand there's NRT in Solr 4, but I can't figure out whether or not 3.5 can do it as well? If not, is it feasible to use an autoCommit every 1000ms? We don't currently process *that* much data so I wonder if it's OK to just commit very often? Obviously not scalable on a large scale, but it is feasible for a relatively small amount of data? I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting everything working smoothly and the process ended up taking my site down for a couple hours. I am very hesitant to upgrade to Solr 4 if it's not necessary to get some sort of NRT functionality. Can anyone help me? Thanks! -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880
Re: cache monitoring tools?
Hoss, I can't see why Network IO is the issue as the shards and the front end SOLR resided on the same server. I said resided, because I got rid of the front end (which according to my measurements, was taking at least as much time for merging as it took to find the actual data in the shards) and shards. Now I have only one shard having all the data. Filter cache tuning also helped to reduce the amount of evictions to a minimum. Dmitry On Fri, Dec 9, 2011 at 10:42 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : The culprit seems to be the merger (frontend) SOLR. Talking to one shard : directly takes substantially less time (1-2 sec). ... :facet.limit=50 Your probably most likeley has very little to do with your caches at all -- a facet.limit that high requires sending a very large amount of data over the wire, multiplied by the number of shards, multipled by some constant (i think it's 2 but it might be higher) in order to over request facet constriant counts from each shard to aggregate them. the dominant factor in the slow speed you are seeing is most likeley Network IO between the shards. -Hoss -- Regards, Dmitry Kan
Re: cache monitoring tools?
Paul, have you checked solrmeter and zabbix? Dmitry On Fri, Dec 9, 2011 at 11:16 PM, Paul Libbrecht p...@hoplahup.net wrote: Allow me to chim in and ask a generic question about monitoring tools for people close to developers: are any of the tools mentioned in this thread actually able to show graphs of loads, e.g. cache counts or CPU load, in parallel to a console log or to an http request log?? I am working on such a tool currently but I have a bad feeling of reinventing the wheel. thanks in advance Paul Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit : Otis, Tomás: thanks for the great links! 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any tool that visualizes JMX stuff like Zabbix. See http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/ On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com wrote: The culprit seems to be the merger (frontend) SOLR. Talking to one shard directly takes substantially less time (1-2 sec). On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com wrote: Tomás: thanks. The page you gave didn't mention cache specifically, is there more documentation on this specifically? I have used solrmeter tool, it draws the cache diagrams, is there a similar tool, but which would use jmx directly and present the cache usage in runtime? pravesh: I have increased the size of filterCache, but the search hasn't become any faster, taking almost 9 sec on avg :( name: search class: org.apache.solr.handler.component.SearchHandler version: $Revision: 1052938 $ description: Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1323255147351 requests : 100 errors : 3 timeouts : 0 totalTime : 885438 avgTimePerRequest : 8854.38 avgRequestsPerSecond : 0.008789442 the stats (copying fieldValueCache as well here, to show term statistics): name: fieldValueCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) stats: lookups : 79 hits : 77 hitratio : 0.97 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 79 cumulative_hits : 77 cumulative_hitratio : 0.97 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78} name: filterCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=153600, initialSize=4096, minSize=138240, acceptableSize=145920, cleanupThread=false) stats: lookups : 1082854 hits : 940370 hitratio : 0.86 inserts : 142486 evictions : 0 size : 142486 warmupTime : 0 cumulative_lookups : 1082854 cumulative_hits : 940370 cumulative_hitratio : 0.86 cumulative_inserts : 142486 cumulative_evictions : 0 index size: 3,25 GB Does anyone have some pointers to where to look at and optimize for query time? 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, cache information is exposed via JMX, so you should be able to monitor that information with any JMX tool. See http://wiki.apache.org/solr/SolrJmx On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com wrote: Yes, we do require that much. Ok, thanks, I will try increasing the maxsize. On Wed, Dec 7, 2011 at 10:56 AM, pravesh suyalprav...@yahoo.com wrote: facet.limit=50 your facet.limit seems too high. Do you actually require this much? Since there a lot of evictions from filtercache, so, increase the maxsize value to your acceptable limit. Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html Sent from the Solr - User mailing list archive at Nabble.com. -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Regards, Dmitry Kan -- Regards, Dmitry Kan
limiting the content of content field in search results
I am developing n application which indexes whole pdfs and other documents to solr. I have completed a working version of my application. But there are some problems. The main one is that when I do a search the indexed whole document is shown. I have used solrj and need some help to reduce this content. How limiting the content of content field in search results and display over there . i need like this *Grammer1.docx* Blazing – burring Faceted Cluster – to gather Geospatial Replication – coping Distinguish – apart from Flawlessly – perfectly Recipe –method Concentrated inscription Last Modified : 2011-12-11T14:42:27Z *who.pdf* Who We Are Table of contents 1 Solr Committers (in alphabetical order)fgfgfgfg2 2 Inactive Committers (in alphabetical orde *version_control.pdf* Solr Version Control System Table of contents 1 Overview.gfgfgfg 2 Web Acce -- View this message in context: http://lucene.472066.n3.nabble.com/limiting-the-content-of-content-field-in-search-results-tp3578859p3578859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 3.4 problem with words separated by coma without space
Thanks for the answer. yes in fact when I look at debugQuery output, I notice that name and number are never treated as single entries. I have (((text:name text:number)) (text:ru) (text:tain) (text:paris))) so name and number are in same parenthesis, but not exactlly treated as a phrase, as far as I know, since a phrase would be more like text:name number. could you tell me what is the difference between (text:name text:number) and (text:name number)? I'll check autoGeneratePhraseQueries. Best regards, Elisabeth 2011/12/8 Chris Hostetter hossman_luc...@fucit.org : If I check in the solr.admin.analyzer, I get the same analysis for the two : different requests. But it seems, if fact, that the lacking space after : coma prevents name and number from matching. query analysis is only part of hte picture ... Did you look at the debuqQuery output? ... i believe you are seeing the effects of the QueryParser analyzing name, distinctly from number in one case, vs analyzing the entire string name,number in the second case, an treating the later as a phrase query (because one input clause produces multiple tokens) there is a recently added autoGeneratePhraseQueries option that affects this. -Hoss
Re: cache monitoring tools?
Justin, in terms of the overhead, have you noticed if Munin puts much of it when used in production? In terms of the solr farm: how big is a shard's index (given you have sharded architecture). Dmitry On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas justin.carat...@gmail.comwrote: At my work, we use Munin and Nagio for monitoring and alerts. Munin is great because writing a plugin for it so simple, and with Solr's statistics handler, we can track almost any solr stat we want. It also comes with included plugins for load, file system stats, processes, etc. http://munin-monitoring.org/ Justin Paul Libbrecht p...@hoplahup.net writes: Allow me to chim in and ask a generic question about monitoring tools for people close to developers: are any of the tools mentioned in this thread actually able to show graphs of loads, e.g. cache counts or CPU load, in parallel to a console log or to an http request log?? I am working on such a tool currently but I have a bad feeling of reinventing the wheel. thanks in advance Paul Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit : Otis, Tomás: thanks for the great links! 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any tool that visualizes JMX stuff like Zabbix. See http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/ On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com wrote: The culprit seems to be the merger (frontend) SOLR. Talking to one shard directly takes substantially less time (1-2 sec). On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com wrote: Tomás: thanks. The page you gave didn't mention cache specifically, is there more documentation on this specifically? I have used solrmeter tool, it draws the cache diagrams, is there a similar tool, but which would use jmx directly and present the cache usage in runtime? pravesh: I have increased the size of filterCache, but the search hasn't become any faster, taking almost 9 sec on avg :( name: search class: org.apache.solr.handler.component.SearchHandler version: $Revision: 1052938 $ description: Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1323255147351 requests : 100 errors : 3 timeouts : 0 totalTime : 885438 avgTimePerRequest : 8854.38 avgRequestsPerSecond : 0.008789442 the stats (copying fieldValueCache as well here, to show term statistics): name: fieldValueCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) stats: lookups : 79 hits : 77 hitratio : 0.97 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 79 cumulative_hits : 77 cumulative_hitratio : 0.97 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78} name: filterCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=153600, initialSize=4096, minSize=138240, acceptableSize=145920, cleanupThread=false) stats: lookups : 1082854 hits : 940370 hitratio : 0.86 inserts : 142486 evictions : 0 size : 142486 warmupTime : 0 cumulative_lookups : 1082854 cumulative_hits : 940370 cumulative_hitratio : 0.86 cumulative_inserts : 142486 cumulative_evictions : 0 index size: 3,25 GB Does anyone have some pointers to where to look at and optimize for query time? 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, cache information is exposed via JMX, so you should be able to monitor that information with any JMX tool. See http://wiki.apache.org/solr/SolrJmx On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com wrote: Yes, we do require that much. Ok, thanks, I will try increasing the maxsize. On Wed, Dec 7, 2011 at 10:56 AM, pravesh suyalprav...@yahoo.com wrote: facet.limit=50 your facet.limit seems too high. Do you actually require this much? Since there a lot of evictions from filtercache, so, increase the maxsize value to your acceptable limit. Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/cache-monitoring-tools-tp3566645p3566811.html Sent from the Solr - User mailing list archive at
InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
Hi there, when highlighting a field with this definition: fieldType name=name class=solr.TextField positionIncrementGap=100 analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ICUTransformFilterFactory id=Any-Latin/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.ICUTransformFilterFactory id=Any-Latin/ filter class=solr.ICUFoldingFilterFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front/ /analyzer /fieldType containing this string: Mosfellsbær I get the following exception, if that field is in the highlight fields: SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token mosfellsbaer exceeds length of provided text sized 11 at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:636) Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token mosfellsbaer exceeds length of provided text sized 11 at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) I tried with solr 3.4 and 3.5, same error for both. Removing the char filter didnt fix the problem either. It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png The end offset remains 11 even after folding and transforming æ to ae, which seems wrong to me. I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500 which seems like a similiar issue. Is there a workaround for that problem or is the field configuration wrong?
Ask about the question of solr cache
When I have delete or add data by application through solrj, or have import index through command nutch solrindex, the cache of solr are not changed if I do not restart solr. Could anyone tell me how could I update solr cache without restarting using shell command? When I recreate the index by nutch, I should update data in solr. I use java -jar start.jar to publish solr. Thanks!
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: The end offset remains 11 even after folding and transforming æ to ae, which seems wrong to me. End offsets refer to the *original text* so this is correct. What is wrong, is EdgeNGramsFilter. See how it turns that 11 to a 12? I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500 which seems like a similiar issue. Is there a workaround for that problem or is the field configuration wrong? For now, don't use EdgeNGrams. -- lucidimagination.com
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642 Thanks for the screenshot, makes it easy to do a test case here. -- lucidimagination.com
Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams
Robert, thank you for creating the issue in JIRA. However, I need ngrams on that field – is there an alternative to the EdgeNGramFilterFactory ? Thanks! On Mon, Dec 12, 2011 at 1:25 PM, Robert Muir rcm...@gmail.com wrote: On Mon, Dec 12, 2011 at 5:18 AM, Max nas...@gmail.com wrote: It seems like there is some weird stuff going on when folding the string, it can be seen in the analysis view, too: http://i.imgur.com/6B2Uh.png I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642 Thanks for the screenshot, makes it easy to do a test case here. -- lucidimagination.com
Re: Setting group.ngroups=true considerable slows down queries
Hi! As as I know currently there isn't another way. Unfortunately the performance degrades badly when having a lot of unique groups. I think an issue should be opened to investigate how we can improve this... Question: Does Solr have a decent chuck of heap space (-Xmx)? Because grouping requires quite some heap space (also without group.ngroups=true). Martijn On 9 December 2011 23:08, Michael Jakl jakl.mich...@gmail.com wrote: Hi! On Fri, Dec 9, 2011 at 17:41, Martijn v Groningen martijn.v.gronin...@gmail.com wrote: On what field type are you grouping and what version of Solr are you using? Grouping by string field is faster. The field is defined as follows: field name=signature type=string indexed=true stored=true / Grouping itself is quite fast, only computing the number of groups seems to increase significantly with the number of documents (linear). I was hoping for a faster solution to compute the total number of distinct documents (or in other terms, the number of distinct values in the signature field). Facets came to mind, but as far as I could see, they don't offer a total number of facets as well. I'm using Solr 3.5 (upgraded from Solr 3.4 without reindexing). Thanks, Michael On 9 December 2011 12:46, Michael Jakl jakl.mich...@gmail.com wrote: Hi, I'm using the grouping feature of Solr to return a list of unique documents together with a count of the duplicates. Essentially I use Solr's signature algorithm to create the signature field and use grouping on it. To provide good numbers for paging through my result list, I'd like to compute the total number of documents found (= matches) and the number of unique documents (= ngroups). Unfortunately, enabling group.ngroups considerably slows down the query (from 500ms to 23000ms for a result list of roughly 30 documents). Is there a faster way to compute the number of groups (or unique values in the signature field) in the search result? My Solr instance currently contains about 50 million documents and around 10% of them are duplicates. Thank you, Michael -- Met vriendelijke groet, Martijn van Groningen -- Met vriendelijke groet, Martijn van Groningen
ExtractingRequestHandler and HTML
I am submitting HTML document to Solr using the ERH. Is it possible to store the contents of the document (including all markup) into a field? Using fmap.content (I am assuming this comes from Tika) stores the extracted text of the document in a field, but not the markup. I want the whole un-altered document. Is this possible? thanks --mike
Re: performance of json vs xml?
How are you getting your documents into Solr? Because if you're using SolrJ it's a moot point because a binary format is used. I haven't done any specific comparisons, but I'd be surprised if JSON took longer. And removing a whole operation from your update chain that had to be kept fed and watered is worth the risk of a bit of slowdown. In other words, Try it and see G... Best Erick On Sun, Dec 11, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote: I'm thinking about modifying my index process to use json because all my docs are originally in json anyway . Are there any performance issues if I insert json docs instead of xml docs? A colleague recommended to me to stay with xml because solr is highly optimized for xml.
Re: Setting group.ngroups=true considerable slows down queries
Hi! On Mon, Dec 12, 2011 at 13:57, Martijn v Groningen martijn.v.gronin...@gmail.com wrote: As as I know currently there isn't another way. Unfortunately the performance degrades badly when having a lot of unique groups. I think an issue should be opened to investigate how we can improve this... Question: Does Solr have a decent chuck of heap space (-Xmx)? Because grouping requires quite some heap space (also without group.ngroups=true). Thanks, for answering. The Server has gotten as much memory as the machine can afford (without swapping): -Xmx21g \ -Xms4g \ Shall I open an issue as a subtask of SOLR-236 even though there is already a performance related task (SOLR-2205)? Cheers, Michael
Re: Setting group.ngroups=true considerable slows down queries
I'd not make a subtaks onder SOLR-236 b/c it is related to a completely different implementation which was never committed. SOLR-2205 is related to general result grouping and think should be closed. I'd make a new issue for improving the performance of group.ngroups=true when there are a lot of unique groups. Martijn On 12 December 2011 14:32, Michael Jakl jakl.mich...@gmail.com wrote: Hi! On Mon, Dec 12, 2011 at 13:57, Martijn v Groningen martijn.v.gronin...@gmail.com wrote: As as I know currently there isn't another way. Unfortunately the performance degrades badly when having a lot of unique groups. I think an issue should be opened to investigate how we can improve this... Question: Does Solr have a decent chuck of heap space (-Xmx)? Because grouping requires quite some heap space (also without group.ngroups=true). Thanks, for answering. The Server has gotten as much memory as the machine can afford (without swapping): -Xmx21g \ -Xms4g \ Shall I open an issue as a subtask of SOLR-236 even though there is already a performance related task (SOLR-2205)? Cheers, Michael -- Met vriendelijke groet, Martijn van Groningen
manipulate the results coming back from SOLR? (was: possible to do arithmetic on returned values?)
I'm hoping I just got lost in the shuffle due to posting on a Friday night. Is there a way to change a field's data via some function, e.g. add, subtract, product, etc.? On 12/9/11 4:17 PM, Gabriel Cooper wrote: Is there a way to manipulate the results coming back from SOLR? I have a SOLR 3.5 index that contains values in cents (e.g. 100 in the index represents $1.00) and in certain contexts (e.g. CSV export) I'd like to divide by 100 for that field to provide a user-friendly in dollars number. To do this I played around with Function Queries for a while without realizing they're limited to relevancy scores, and later found DocTransformers in 4.0 whose description sounded right but don't exist in 3.5. Is there anything else I haven't considered? Thanks for any help Gabriel Cooper.
Re: cache monitoring tools?
Dmitry, The only added stress that munin puts on each box is the 1 request per stat per 5 minutes to our admin stats handler. Given that we get 25 requests per second, this doesn't make much of a difference. We don't have a sharded index (yet) as our index is only 2-3 GB, but we do have slave servers with replicated indexes that handle the queries, while our master handles updates/commits. Justin Dmitry Kan dmitry@gmail.com writes: Justin, in terms of the overhead, have you noticed if Munin puts much of it when used in production? In terms of the solr farm: how big is a shard's index (given you have sharded architecture). Dmitry On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas justin.carat...@gmail.comwrote: At my work, we use Munin and Nagio for monitoring and alerts. Munin is great because writing a plugin for it so simple, and with Solr's statistics handler, we can track almost any solr stat we want. It also comes with included plugins for load, file system stats, processes, etc. http://munin-monitoring.org/ Justin Paul Libbrecht p...@hoplahup.net writes: Allow me to chim in and ask a generic question about monitoring tools for people close to developers: are any of the tools mentioned in this thread actually able to show graphs of loads, e.g. cache counts or CPU load, in parallel to a console log or to an http request log?? I am working on such a tool currently but I have a bad feeling of reinventing the wheel. thanks in advance Paul Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit : Otis, Tomás: thanks for the great links! 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any tool that visualizes JMX stuff like Zabbix. See http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/ On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan dmitry@gmail.com wrote: The culprit seems to be the merger (frontend) SOLR. Talking to one shard directly takes substantially less time (1-2 sec). On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan dmitry@gmail.com wrote: Tomás: thanks. The page you gave didn't mention cache specifically, is there more documentation on this specifically? I have used solrmeter tool, it draws the cache diagrams, is there a similar tool, but which would use jmx directly and present the cache usage in runtime? pravesh: I have increased the size of filterCache, but the search hasn't become any faster, taking almost 9 sec on avg :( name: search class: org.apache.solr.handler.component.SearchHandler version: $Revision: 1052938 $ description: Search using components: org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent, stats: handlerStart : 1323255147351 requests : 100 errors : 3 timeouts : 0 totalTime : 885438 avgTimePerRequest : 8854.38 avgRequestsPerSecond : 0.008789442 the stats (copying fieldValueCache as well here, to show term statistics): name: fieldValueCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=1, initialSize=10, minSize=9000, acceptableSize=9500, cleanupThread=false) stats: lookups : 79 hits : 77 hitratio : 0.97 inserts : 1 evictions : 0 size : 1 warmupTime : 0 cumulative_lookups : 79 cumulative_hits : 77 cumulative_hitratio : 0.97 cumulative_inserts : 1 cumulative_evictions : 0 item_shingleContent_trigram : {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78} name: filterCache class: org.apache.solr.search.FastLRUCache version: 1.0 description: Concurrent LRU Cache(maxSize=153600, initialSize=4096, minSize=138240, acceptableSize=145920, cleanupThread=false) stats: lookups : 1082854 hits : 940370 hitratio : 0.86 inserts : 142486 evictions : 0 size : 142486 warmupTime : 0 cumulative_lookups : 1082854 cumulative_hits : 940370 cumulative_hitratio : 0.86 cumulative_inserts : 142486 cumulative_evictions : 0 index size: 3,25 GB Does anyone have some pointers to where to look at and optimize for query time? 2011/12/7 Tomás Fernández Löbbe tomasflo...@gmail.com Hi Dimitry, cache information is exposed via JMX, so you should be able to monitor that information with any JMX tool. See http://wiki.apache.org/solr/SolrJmx On Wed, Dec 7, 2011 at 6:19 AM, Dmitry Kan dmitry@gmail.com wrote: Yes, we do require that much. Ok, thanks, I will try increasing the maxsize. On Wed, Dec 7, 2011 at 10:56 AM,
Re: RegexQuery performance
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson erickerick...@gmail.com wrote: My off-the-top-of-my-head notion is you implement a Filter whose job is to emit some special tokens when you find strings like this that allow you to search without regexes. For instance, in the example you give, you could index something like...oh... I don't know, ###VER### as well as the normal text of IRAS-A-FPA-3-RDR-IMPS-V6.0. Now, when searching for docs with the pattern you used as an example, you look for ###VER### instead. I guess it all depends on how many regexes you need to allow. This wouldn't work at all if you allow users to put in arbitrary regexes, but if you have a small enough number of patterns you'll allow, something like this could work. This is a great suggestion. I think the number of users that need this feature, as well as the variety of regexs that would be used, is small enough that it could definitely work. I turns it into a problem of collecting the necessary regexes, plus the UI details. Thanks! --jay
Re: limiting the content of content field in search results
Hi, It sounds like highlighting might be the solution for you. See http://wiki.apache.org/solr/HighlightingParameters *Juan* On Mon, Dec 12, 2011 at 4:42 AM, ayyappan ayyaba...@gmail.com wrote: I am developing n application which indexes whole pdfs and other documents to solr. I have completed a working version of my application. But there are some problems. The main one is that when I do a search the indexed whole document is shown. I have used solrj and need some help to reduce this content. How limiting the content of content field in search results and display over there . i need like this *Grammer1.docx* Blazing – burring Faceted Cluster – to gather Geospatial Replication – coping Distinguish – apart from Flawlessly – perfectly Recipe –method Concentrated inscription Last Modified : 2011-12-11T14:42:27Z *who.pdf* Who We Are Table of contents 1 Solr Committers (in alphabetical order)fgfgfgfg2 2 Inactive Committers (in alphabetical orde *version_control.pdf* Solr Version Control System Table of contents 1 Overview.gfgfgfg 2 Web Acce -- View this message in context: http://lucene.472066.n3.nabble.com/limiting-the-content-of-content-field-in-search-results-tp3578859p3578859.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Load Testing
Hi, I ran some jmeter load testing on my solr instance version 3.5.0 running on tomcat 6.6.29 using 1000 concurrent users and the error below is thrown after a certain number of requests. My solr configuration is basically the default configuration at this time. Has anybody done soemthing similar? Should solr be able to handle 1000 concurrent users based on the default configuration? Any ideas let me know. Thanks. 12-Dec-2011 15:56:02 org.apache.solr.common.SolrException log SEVERE: ClientAbortException: java.io.IOException at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:319) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288) at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212) at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:861) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1584) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException at org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:696) at org.apache.coyote.http11.InternalAprOutputBuffer.flush(InternalAprOutputBuffer.java:284) at org.apache.coyote.http11.Http11AprProcessor.action(Http11AprProcessor.java:1016) at org.apache.coyote.Response.action(Response.java:183) at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314) ... 20 more
Re: Virtual Memory very high
On 12/11/2011 4:57 AM, Rohit wrote: What are the difference in the different DirectoryFactory? http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/store/MMapDirectory.html http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/store/NIOFSDirectory.html
Re: MySQL data import
Hi all, Any tips on this one? Thanks, Brian Lamb On Sun, Dec 11, 2011 at 3:54 PM, Brian Lamb brian.l...@journalexperts.comwrote: Hi all, I have a few questions about how the MySQL data import works. It seems it creates a separate connection for each entity I create. Is there any way to avoid this? By nature of my schema, I have several multivalued fields. Each one I populate with a separate entity. Is there a better way to do it? For example, could I pull in all the singular data in one sitting and then come back in later and populate with the multivalued items. An alternate approach in some cases would be to do a GROUP_CONCAT and then populate the multivalued column with some transformation. Is that possible? Lastly, is it possible to use copyField to copy three regular fields into one multiValued field and have all the data show up? Thanks, Brian Lamb
URLDataSource delta import
Hi all, According to http://wiki.apache.org/solr/DataImportHandler#Usage_with_XML.2BAC8-HTTP_Datasource a delta-import is not currently implemented for URLDataSource. I say currently because I've noticed that such documentation is out of date in many places. I wanted to see if this feature had been added yet or if there were plans to do so. Thanks, Brian Lamb
Possible to configure the fq caching settings on the server?
Is it possible to configure solr such that the filter query cache settings is set to fq={!cache=false} by default? -- Andrew Lundgren lundg...@familysearch.org NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: MySQL data import
On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I have a few questions about how the MySQL data import works. It seems it creates a separate connection for each entity I create. Is there any way to avoid this? Not sure, but I do not think that it is possible. However, from your description below, I think that you are unnecessarily multiplying entities. By nature of my schema, I have several multivalued fields. Each one I populate with a separate entity. Is there a better way to do it? For example, could I pull in all the singular data in one sitting and then come back in later and populate with the multivalued items. Not quite sure as to what you mean. Would it be possible for you to post your schema.xml, and the DIH configuration file? Preferably, put these on pastebin.com, and send us links. Also, you should obfuscate details like access passwords. An alternate approach in some cases would be to do a GROUP_CONCAT and then populate the multivalued column with some transformation. Is that possible? [...] This is how we have been handling it. A complete description would be long, but here is the gist of it: * A transformer will be needed. In this case, we found it easiest to use a Java-based transformer. Thus, your entity should include something like entity name=myname dataSource=mysource transformer=com.mycompany.search.solr.handler.JobsNumericTransformer... ... /entity Here, the class name to be used for the transformer attribute follows the usual Java rules, and the .jar needs to be made available to Solr. * The SELECT statement for the entity looks something like select group_concat( myfield SEPARATOR '@||@')... The separator should be something that does not occur in your normal data stream. * Within the entity, define field column=myfield/ * There are complications involved if NULL values are allowed for the field, in which case you would need to use COALESCE, maybe along with CAST * The transformer would look up myfield, split along the separator, and populate the multi-valued field. This *is* a little complicated, so I would also like to hear about possible alternatives. Regards, Gora
Re: Trim and copy a solr field
Hi Swapna, You could try using a copyField to a field that uses PatternReplaceFilterFactory: fieldType class=solr.TextField name=path_location analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.* replacement=$1/ /analyzer /fieldType The regular expression may not be exactly what you want, but it will give you an idea of how to do it. I'm pretty sure there must be some other ways of doing this, but this is the first that comes to my mind. *Juan* On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.comwrote: Hi, I have a Solr field that contains the absolute path of the file that is indexed, which will be something like file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg. Am interested in indexing the location in a separate field. I was looking for some way to trim the field value from last occurrence of char /, so that I can get the location value, something like file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2, and store it in a new field. Can you please suggest some way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Re: MySQL data import
You might want to consider just doing the whole thing in SolrJ with a JDBC connection. When things get complex, it's sometimes more straightforward. Best Erick... P.S. Yes, it's pretty standard to have a single field be the destination for several copyField directives. On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote: On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I have a few questions about how the MySQL data import works. It seems it creates a separate connection for each entity I create. Is there any way to avoid this? Not sure, but I do not think that it is possible. However, from your description below, I think that you are unnecessarily multiplying entities. By nature of my schema, I have several multivalued fields. Each one I populate with a separate entity. Is there a better way to do it? For example, could I pull in all the singular data in one sitting and then come back in later and populate with the multivalued items. Not quite sure as to what you mean. Would it be possible for you to post your schema.xml, and the DIH configuration file? Preferably, put these on pastebin.com, and send us links. Also, you should obfuscate details like access passwords. An alternate approach in some cases would be to do a GROUP_CONCAT and then populate the multivalued column with some transformation. Is that possible? [...] This is how we have been handling it. A complete description would be long, but here is the gist of it: * A transformer will be needed. In this case, we found it easiest to use a Java-based transformer. Thus, your entity should include something like entity name=myname dataSource=mysource transformer=com.mycompany.search.solr.handler.JobsNumericTransformer... ... /entity Here, the class name to be used for the transformer attribute follows the usual Java rules, and the .jar needs to be made available to Solr. * The SELECT statement for the entity looks something like select group_concat( myfield SEPARATOR '@||@')... The separator should be something that does not occur in your normal data stream. * Within the entity, define field column=myfield/ * There are complications involved if NULL values are allowed for the field, in which case you would need to use COALESCE, maybe along with CAST * The transformer would look up myfield, split along the separator, and populate the multi-valued field. This *is* a little complicated, so I would also like to hear about possible alternatives. Regards, Gora
Re: Solr Load Testing
Hi, 1000 *concurrent* *queries* is a lot. If your index is small relatively to hw specs, sure. If not, then tuning may be needed, including maybe Tomcat and JVM level tuning. The error below is from Tomcat, not really tied to Solr... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message - From: Kissue Kissue kissue...@gmail.com To: solr-user@lucene.apache.org Cc: Sent: Monday, December 12, 2011 11:43 AM Subject: Solr Load Testing Hi, I ran some jmeter load testing on my solr instance version 3.5.0 running on tomcat 6.6.29 using 1000 concurrent users and the error below is thrown after a certain number of requests. My solr configuration is basically the default configuration at this time. Has anybody done soemthing similar? Should solr be able to handle 1000 concurrent users based on the default configuration? Any ideas let me know. Thanks. 12-Dec-2011 15:56:02 org.apache.solr.common.SolrException log SEVERE: ClientAbortException: java.io.IOException at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:319) at org.apache.catalina.connector.OutputBuffer.flush(OutputBuffer.java:288) at org.apache.catalina.connector.CoyoteOutputStream.flush(CoyoteOutputStream.java:98) at sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:278) at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122) at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212) at org.apache.solr.common.util.FastWriter.flush(FastWriter.java:115) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:344) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:265) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:861) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1584) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.IOException at org.apache.coyote.http11.InternalAprOutputBuffer.flushBuffer(InternalAprOutputBuffer.java:696) at org.apache.coyote.http11.InternalAprOutputBuffer.flush(InternalAprOutputBuffer.java:284) at org.apache.coyote.http11.Http11AprProcessor.action(Http11AprProcessor.java:1016) at org.apache.coyote.Response.action(Response.java:183) at org.apache.catalina.connector.OutputBuffer.doFlush(OutputBuffer.java:314) ... 20 more
Re: performance of json vs xml?
On Sun, Dec 11, 2011 at 3:16 PM, Jason Toy jason...@gmail.com wrote: I'm thinking about modifying my index process to use json because all my docs are originally in json anyway . Are there any performance issues if I insert json docs instead of xml docs? A colleague recommended to me to stay with xml because solr is highly optimized for xml. I'd make a big bet the JSON parsing is faster than the xml parsing. And you have the cost of converting your docs to XML... If you are too worried, do some testing. I'd simply use JSON. The JSON support should be considered first class - it just came after the XML support. -- - Mark http://www.lucidimagination.com
Re: SmartChineseAnalyzer
: Subject: SmartChineseAnalyzer : References: : CAMye=3oOSfePwDEy4Off89jBTUN=K3G0=btaaxghtxvpc_v...@mail.gmail.com : can4yxvdc21zehiio+kkws53d_vrqn8tqc3-0qn8kq31unq7...@mail.gmail.com : CAMye=3ot32a02px6yotopkkkmobexw7xpv9sxzc32xkra-u...@mail.gmail.com : In-Reply-To: : CAMye=3ot32a02px6yotopkkkmobexw7xpv9sxzc32xkra-u...@mail.gmail.com http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is hidden in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. -Hoss
Facet on same date field multiple times
I've Googled around a bit and seen this referenced a few times, but cannot seem to get it to work I have a query that looks like this: facet=true facet.date={!key=foo}date f.foo.facet.date.start=2010-12-12T00:00:00Z f.foo.facet.date.end=2011-12-12T00:00:00Z f.foo.facet.date.gap=%2B1DAY Eventually the goal is to do different ranges on the same field. Month by day. Day by hour. Year by week. Something to that effect. But I thought I'd start simple to see if I could get the syntax right and what I have above doesn't seem to work. I get: message Missing required parameter: f.date.facet.date.start (or default: facet.date.start) description The request sent by the client was syntactically incorrect (Missing required parameter: f.date.facet.date.start (or default: facet.date.start)). So it doesn't seem interested in me using the local key. From reading here: http://lucene.472066.n3.nabble.com/Date-Faceting-on-Solr-3-1-td3302499.html#a3309517 it would seem i should be able to do it (see the note at the bottom). I know one option is to copyField the date into a few other spots, and I can use that as a last resort, but if this works and I'm just arsing something up... -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-on-same-date-field-multiple-times-tp3580449p3580449.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet on same date field multiple times
: Eventually the goal is to do different ranges on the same field. Month by : day. Day by hour. Year by week. Something to that effect. But I thought : I'd start simple to see if I could get the syntax right and what I have : above doesn't seem to work. ... : So it doesn't seem interested in me using the local key. From reading here: : http://lucene.472066.n3.nabble.com/Date-Faceting-on-Solr-3-1-td3302499.html#a3309517 : it would seem i should be able to do it (see the note at the bottom). That was me, and i was wrong in that post ... what worked was changing the output key, but using that key to specify the various date (ie: range) based params has never worked, and i didn't realize that at the time. The work to try and fix this is currently being tracked in tihs Jira issue, i recently spelled out what i think would be needed to finish it up, but i don't think anyone is actively working on it (if you want to jump in, patches would certianly be welcome)... https://issues.apache.org/jira/browse/SOLR-1351 -Hoss
Re: MySQL data import
Thanks all. Erick, is there documentation on doing things with SolrJ and a JDBC connection? On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.comwrote: You might want to consider just doing the whole thing in SolrJ with a JDBC connection. When things get complex, it's sometimes more straightforward. Best Erick... P.S. Yes, it's pretty standard to have a single field be the destination for several copyField directives. On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote: On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I have a few questions about how the MySQL data import works. It seems it creates a separate connection for each entity I create. Is there any way to avoid this? Not sure, but I do not think that it is possible. However, from your description below, I think that you are unnecessarily multiplying entities. By nature of my schema, I have several multivalued fields. Each one I populate with a separate entity. Is there a better way to do it? For example, could I pull in all the singular data in one sitting and then come back in later and populate with the multivalued items. Not quite sure as to what you mean. Would it be possible for you to post your schema.xml, and the DIH configuration file? Preferably, put these on pastebin.com, and send us links. Also, you should obfuscate details like access passwords. An alternate approach in some cases would be to do a GROUP_CONCAT and then populate the multivalued column with some transformation. Is that possible? [...] This is how we have been handling it. A complete description would be long, but here is the gist of it: * A transformer will be needed. In this case, we found it easiest to use a Java-based transformer. Thus, your entity should include something like entity name=myname dataSource=mysource transformer=com.mycompany.search.solr.handler.JobsNumericTransformer... ... /entity Here, the class name to be used for the transformer attribute follows the usual Java rules, and the .jar needs to be made available to Solr. * The SELECT statement for the entity looks something like select group_concat( myfield SEPARATOR '@||@')... The separator should be something that does not occur in your normal data stream. * Within the entity, define field column=myfield/ * There are complications involved if NULL values are allowed for the field, in which case you would need to use COALESCE, maybe along with CAST * The transformer would look up myfield, split along the separator, and populate the multi-valued field. This *is* a little complicated, so I would also like to hear about possible alternatives. Regards, Gora
Re: Possible to configure the fq caching settings on the server?
: Is it possible to configure solr such that the filter query cache : settings is set to fq={!cache=false} by default? well, you could always disable the filterCache -- but i get the impression you want *most* fq filters to not be cached, but sometimes you'll specify some thta you *do* want cached? is that it? I don't know of anyway to do that (or even anyway to change solr easily to make that posisble) for *only* the fq params. I was going to suggest that something like this should work a a way to disable caching of all queries unless you explicitly re-enable it... ?cache=false?q={!cache=true}foofq=barfq={!cache=true}yak ...in which case you could change up your q param so it would default to being cached (and move that cache=false to a default in your solrconfig if you desired)... ?cache=false?q={!cache=true v=$qq}qq=foofq=barfq={!cache=true}yak ...but eviddently thta doesn't work. aparently cache is only consulted as a local param, and doesn't default up to the other request (or configed default) SolrParams. I'm not sure if that was intentional or an oversight -- but if you'd like to open a Jira requesting that it work someone could probably look into it (patches welcome!) -Hoss
Re: sub query parsing bug???
Thanks for the reply! I do believe I have set (or have tried setting) all of those options for the default query and none of them seem to help. Anytime an OR appears inside the query the default for that query becomes OR. At least thats the anecdotal evidence I've encountered. Also in this case the results do match what the parser is telling me, so I'm not getting the results I expect. As for the second suggestion, the actual fields searched are controlled by the user, so it can get more complicated. But even in the single field search I do believe I need to use the edismax parser. I have tried the regular query syntax for searching one field and find that it can't handle the more complex queries. Something like ref_expertise:(nonlinear OR soliton) AND optical lattice won't return any documents even though there are many that satisfy those requirements. Is there some other way I could be executing this query even in the single field case? Thanks and Thanks in Advance for all help Steve On Dec 6, 2011, at 8:26 AM, Erick Erickson wrote: Hmmm, does this help? In Solr 1.4 and prior, you should basically set mm=0 if you want the equivilent of q.op=OR, and mm=100% if you want the equivilent of q.op=AND. In 3.x and trunk the default value of mm is dictated by the q.op param (q.op=AND = mm=100%; q.op=OR = mm=0%). Keep in mind the default operator is effected by your schema.xml solrQueryParser defaultOperator=xxx/ entry. In older versions of Solr the default value is 100% (all clauses must match) (from http://wiki.apache.org/solr/DisMaxQParserPlugin). I don't think you'll see the query parsed as you expect, but the results of the query should be what you expect. Tricky, eh? I'm assuming you've simplified the example for clarity and your qf will be on more than one field when you use it for real, but if not the actual query doesn't need edismax at all. Best Erick On Mon, Dec 5, 2011 at 10:52 AM, Steve Fuchs st...@aps.org wrote: Hello All, I have my field description listed below, but I don't think its pertinent. As my issue seems to be with the query parser. I'm currently using an edismax subquery clause to help with my searching as such: _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) AND \optical lattice\ translates correctly to +(+((ref_expertise:nonlinear) (ref_expertise:soliton)) +(ref_expertise:optical lattice)) but the users expect the default operator to be AND (it is in all simpler searches), however nothing I can do here gets me that same result as above when the search is: _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) \optical lattice\ this gets converted to: +(((ref_expertise:nonlinear) (ref_expertise:soliton)) (ref_expertise:optical lattice)) where the optical lattice is optional. These produce the same results, trying q.op and mm. Also the default search term as set in the solr.config is AND. _query_:{!type=edismax q.op=AND qf='ref_expertise'}\(nonlinear OR soliton\)\optical lattice\ _query_:{!type=edismax mm=1.0 qf='ref_expertise'}\(nonlinear OR soliton\)\optical lattice\ Any ideas??? Thanks In Advance Steven Fuchs fieldType name=intl_string class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory / /analyzer /fieldType
Re: NRT or similar for Solr 3.5?
Yeah, running Chrome on OSX and doesn't do anything. Just switched to Firefox and it works. *But*, also don't seem to be receiving confirmation email. -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 2011/12/12 vikram kamath kmar...@gmail.com The Onclick handler does not seem to be called on google chrome (Ubuntu ). Also , I dont seem to receive the email with the confirmation link on registering (I have checked my spam) Regards Vikram Kamath 2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: There is an onclick handler that allows you to download the src. BTW, an early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is available for download. So please give it a try. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/10/2011 11:18 PM, Steven Ou wrote: All the links on the download section link to http://solr-ra.tgels.org/# -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: Not sure why you had problems, #downloads ( http://solr-ra.tgels.org/#downloads ) should point you to the downloads section showing the different versions available for download ? Please share if this is not so ( there were downloads yesterday with no problems ) Regarding NRT, you can switch between RA and Lucene at query level or at config level; in the current version with RA, NRT is in effect while with lucene, it is not, you can get more information from here: http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf Solr 3.5 with RankingAlgorithm 1.3 should be available next week. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/9/2011 4:49 PM, Steven Ou wrote: Hey Nagendra, I took a look and Solr-RA looks promising - but: - I could not figure out how to download it. It seems like all the download links just point to # - I wasn't looking for another ranking algorithm, so would it be possible for me to use NRT but *not* RA (i.e. just use the normal Lucene library)? -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Steven: Please take a look at Solr with RankingAlgorithm. It offers NRT functionality. You can set your autoCommit to about 15 mins. You can get more information from here: http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.**org http://rankingalgorithm.tgels.org On 12/8/2011 9:30 PM, Steven Ou wrote: Hi guys, I'm looking for NRT functionality or similar in Solr 3.5. Is that possible? From what I understand there's NRT in Solr 4, but I can't figure out whether or not 3.5 can do it as well? If not, is it feasible to use an autoCommit every 1000ms? We don't currently process *that* much data so I wonder if it's OK to just commit very often? Obviously not scalable on a large scale, but it is feasible for a relatively small amount of data? I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting everything working smoothly and the process ended up taking my site down for a couple hours. I am very hesitant to upgrade to Solr 4 if it's not necessary to get some sort of NRT functionality. Can anyone help me? Thanks! -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880
Removing whitespace
Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B
Re: Removing whitespace
That sounds strange requirement, but I think you can use CharFilters instead of implementing your own Tokenizer. Take a look at this section, maybe it helps. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories The On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B -- Alireza Salimi Java EE Developer
RE: Removing whitespace
Hi Devon, Something like this should work for you (untested!): analyzer !-- Remove non-word characters; only underscores, letters numbers allowed -- charFilter class=solr.PatternReplaceCharFilterFactory pattern=\W+ replacement=/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.NGramFilterFactory minGramSize=2 maxGramSize=2/ /analyzer Steve -Original Message- From: Devon Baumgarten [mailto:dbaumgar...@nationalcorp.com] Sent: Monday, December 12, 2011 4:52 PM To: 'solr-user@lucene.apache.org' Subject: Removing whitespace Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B
Re: Removing whitespace
(11/12/13 6:51), Devon Baumgarten wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. How about using one of existing charfilters? https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/PatternReplaceCharFilterFactory.html https://builds.apache.org/job/Solr-3.x/javadoc/org/apache/solr/analysis/MappingCharFilterFactory.html koji -- Check out Query Log Visualizer for Apache Solr http://www.rondhuit-demo.com/loganalyzer/loganalyzer.html http://www.rondhuit.com/en/
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten -Original Message- From: Alireza Salimi [mailto:alireza.sal...@gmail.com] Sent: Monday, December 12, 2011 4:08 PM To: solr-user@lucene.apache.org Subject: Re: Removing whitespace That sounds strange requirement, but I think you can use CharFilters instead of implementing your own Tokenizer. Take a look at this section, maybe it helps. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories The On Mon, Dec 12, 2011 at 4:51 PM, Devon Baumgarten dbaumgar...@nationalcorp.com wrote: Hello, I am having trouble finding how to remove/ignore whitespace when indexing. The only answer I have found suggested that it is necessary to write my own tokenizer. Is this true? I want to remove whitespace and special characters from the phrase and create N-grams from the result. Ultimately, the effect I am after is that searching bobdole would match Bob Dole, Bo B. Dole, and maybe Bobdo. Maybe there is a better way... can anyone lend some assistance? Thanks! Dev B -- Alireza Salimi Java EE Developer
RE: Removing whitespace
Thanks Alireza, Steven and Koji for the quick responses! I'll read up on those and give it a shot. Devon Baumgarten
Re: MySQL data import
Here's a quick demo I wrote at one point. I haven't run it in a while, but you should be able to get the idea. package jdbc; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer; import org.apache.solr.client.solrj.impl.XMLResponseParser; import org.apache.solr.common.SolrInputDocument; import java.io.IOException; import java.sql.*; import java.util.ArrayList; import java.util.Collection; public class Indexer { public static void main(String[] args) { startIndex(http://localhost:8983/solr;); } private static void startIndex(String url) { Connection con = DataSource.getConnection(); try { long start = System.currentTimeMillis(); // Create a multi-threaded communications channel to the Solr server. Full interface (3.3) at: // http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html StreamingUpdateSolrServer server = new StreamingUpdateSolrServer(url, 10, 4); // You may want to set these timeouts higer, Solr occasionally will have long pauses while // segments merge. server.setSoTimeout(1000); // socket read timeout server.setConnectionTimeout(100); //server.setDefaultMaxConnectionsPerHost(100); //server.setMaxTotalConnections(100); //server.setFollowRedirects(false); // defaults to false // allowCompression defaults to false. // Server side must support gzip or deflate for this to have any effect. //server.setAllowCompression(true); server.setMaxRetries(1); // defaults to 0. 1 not recommended. server.setParser(new XMLResponseParser()); // binary parser is used by default doDocuments(server, con); server.commit(); // Only needs to be done at the end, autocommit or commitWithin should // do the rest. long endTime = System.currentTimeMillis(); System.out.println(Total Time Taken- + (endTime - start) + mils); } catch (Exception e) { e.printStackTrace(); String msg = e.getMessage(); System.out.println(msg); } } private static void doDocuments(StreamingUpdateSolrServer server, Connection con) throws SQLException, IOException, SolrServerException { Statement st = con.createStatement(); ResultSet rs = st.executeQuery(select id,title,text from test); // SolrInputDocument interface (3.3) at // http://lucene.apache.org/solr/api/org/apache/solr/common/SolrInputDocument.html CollectionSolrInputDocument docs = new ArrayListSolrInputDocument(); int total = 0; int counter = 0; while (rs.next()) { SolrInputDocument doc = new SolrInputDocument(); // DO NOT move this outside the while loop // or be sure to call doc.clear() String id = rs.getString(id); String title = rs.getString(title); String text = rs.getString(text); doc.addField(id, id); doc.addField(title, title); doc.addField(text, text); docs.add(doc); ++counter; ++total; if (counter 1000) { // Completely arbitrary, just batch up more than one document for throughput! server.add(docs); docs.clear(); counter = 0; } } System.out.println(Total + total + Docs added succesfully); } } // Trivial class showing connecting to a MySql database server via jdbc... class DataSource { public static Connection getConnection() { Connection conn = null; try { Class.forName(com.mysql.jdbc.Driver).newInstance(); System.out.println(Driver Loaded..); conn = DriverManager.getConnection(jdbc:mysql://172.16.0.169:3306/test? + user=testuserpassword=test123); System.out.println(Connection build..); } catch (Exception ex) { System.out.println(ex); } return conn; } public static void closeConnection(Connection con) { try { if (con != null) con.close(); } catch (SQLException e) { e.printStackTrace(); } } } On Mon, Dec 12, 2011 at 2:57 PM, Brian Lamb brian.l...@journalexperts.com wrote: Thanks all. Erick, is there documentation on doing things with SolrJ and a JDBC connection? On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson erickerick...@gmail.comwrote: You might want to consider just doing the whole thing in SolrJ with a JDBC connection. When things get complex, it's sometimes more straightforward. Best Erick... P.S. Yes, it's pretty standard to have a single field be the destination for several copyField directives. On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty g...@mimirtech.com wrote: On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb brian.l...@journalexperts.com wrote: Hi all, I have a few questions about how the MySQL data import works. It seems it creates a separate connection for each entity I create. Is there any way to avoid this? Not sure, but I do not think that it is possible. However, from your
Re: Images for the DataImportHandler page
: There is some very useful information on the : http://wiki.apache.org/solr/DataImportHandler page about indexing : database contents, but the page contains three images whose links are : broken. The descriptions of those images sound like it would be quite : handy to see them in the page. Could someone please fix the links so the : images are displayed? Images, and all attachments in general, were disabled some time back for all of wiki.apache.org. Pages that still refer/link to old attachments just never got updated after the fact to reflect this. ASF Infra has a policy permitting individual wiki's to re-enable attachment support, but doing so would require switching the entire wiki over to a new ACL model, where only people who had been granted explicit access to perform edits would be allowed to do so. My personal opinion is that i'd rather have a low barrier for editing the wiki (ie: register and do a textcha) and live w/o images; rather then have images, but have a high barrier to editing (ie: register, ask for edit permission from a committer, *and* do textchas). But i'm open to other suggestions... https://wiki.apache.org/general/OurWikiFarm https://wiki.apache.org/general/OurWikiFarm#Attachments -Hoss
Re: server down caused by complex query
: Because our user send very long and complex queries with asterisk and near : operator. : Sometimes near operator exceeds 1,000 and keywords almost include asterisk. : If such query is sent to server, jvm memory is full. (our jvm memory near operator isn't something I know of as a built in feature of SOlr (definitely not Solr 1.4) ... which query parser are you using? what is the value of your maxBooleanClauses setting in solrconfig.xml? that's the method that should help to limit the risk of query explosion if users try to overwelm the server with really large queries, but for wildcard and prefix queries (ie: using *) even Solr 1.4 implemented those using ConstantScoreQuery instead of using query expansion, so i'm no sure how/why a single query could eat up so much ram. In general, there have been a lot of improvements in memory usage in recent versions of Solr, so i suggest you upgrade to Solr 3.5 -- but beyond that basic advice any other suggestions will require a *lot* more specifics about exactly waht your configs look like, the full requests (all params) of queries that are causing you problems, detials on your JVM configuration, etc... -Hoss
FTP mount crash when crawling with solrj
I have a lots of files in my FTP account,and i use the curlftpfs to mount them to folder and then start index them with solrj api, but after a minutes pass something strange happen and the mounted folder is not accessible and crash,also i can not unmount it and the message device is in use appear, my solrj code is OK and i test it with my local files and the result is great but indexing mounted folder is my terrible problem, i mention that i use the curlftpfs with both centOS,fedora and Ubuntu but the result of crashing is the same,how can i fix this problem? is the problem with the my code? is sombody have ever face this problem when indexed of mounted folder? -- View this message in context: http://lucene.472066.n3.nabble.com/FTP-mount-crash-when-crawling-with-solrj-tp3580982p3580982.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails
Greetings! On the Nutch Tutorial: I can run the following commands with Solr-3.5.0/Nutch-1.4: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 then: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* successfully. But, if I run: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 It fails with the following messages: SolrIndexer: starting at 2011-12-11 14:01:27 Adding 11 documents SolrIndexer: finished at 2011-12-11 14:01:28, elapsed: 00:00:01 SolrDeleteDuplicates: starting at 2011-12-11 14:01:28 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) I am running on Ubuntu 10.10 with 12 GB of memory, Java version 1.6.0_26. I can delete the crawl directory and replicate this error consistently. Suggestions? Other than ...use the way that doesn't fail. ;-) I am concerned that a different invocation of Solr failing consistently represents something that may cause trouble elsewhere when least expected. (And hard to isolate as the problem.) Thanks! Hope everyone is having a great weekend! Patrick PS: From the hadoop log (when it fails) if that's helpful: 2011-12-11 15:21:51,436 INFO solr.SolrWriter - Adding 11 documents 2011-12-11 15:21:52,250 INFO solr.SolrIndexer - SolrIndexer: finished at 2011-12-11 15:21:52, elapsed: 00:00:01 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2011-12-11 15:21:52 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ 2011-12-11 15:21:52,330 WARN mapred.LocalJobRunner - job_local_0020 java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) -- Patrick Durusau patr...@durusau.net Chair, V1 - US TAG to JTC 1/SC 34 Convener, JTC 1/SC 34/WG 3 (Topic Maps) Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300 Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps) OASIS Technical Advisory Board (TAB) - member Another Word For It (blog): http://tm.durusau.net Homepage: http://www.durusau.net Twitter: patrickDurusau
highlighting questions
I am trying to figure out how to display search query fields highlighted in html. I can enable the highlighting in the query, and I think I get the correct response back (See below: I search using 'Contents' and the highlighting is shown with strong and /strong. However, I can't figure out what to add to the xslt file to display in html. I think it is a question of defining the appropriate xpath(?), but I am stuck. Can someone point me in the right direction? Thanks in advance! Here is the result I get back: ?xml version=1.0 encoding=UTF-8 ? - response - lstname=responseHeader intname=status0/int intname=QTime20/int - lstname=params str name=explainOther/ strname=indenton/str strname=hl.simple.pre'strong'/str strname=hl.fl*/str str name=wt/ strname=hlon/str strname=rows10/str strname=version2.2/str str name=fl/ strname=start0/str strname=qcontents/str strname=hl.simple.post'/strong'/str str name=qt/ str name=fq/ /lst /lst - resultname=responsenumFound=1start=0 - doc - arrname=content strStart with the Table of Contents. See if you can find the topic that you are interested in. Look through the section to see if there is a resource that can help you. If you find one, you may want to attach a Post-it tab so you can find the page later. Write down all of the information that you need to find out more information about the resource: agency name, name of contact person, telephone number, email and website addresses. If you were unable to find a resource that will help you in this resource guide, a good first step would be to call your local Independent Living Center. They will have a good idea of what is available in your area. A second step would be to call or email us at the Rehabilitation Research Center. We have a ROBOT resource specialist who may be able to assist. You can reach Lois Roberts, the “Back On Track …To Success” Mentoring Program Assistant, at 408-793-6426 or email her at lois.robe...@hhs.sccgov.org/str /arr - arrname=doclink strrobot.pdf#page=11/str /arr strname=heading1CHAPTER 1: How to Use This Resource Guide/str strname=id1-1/str /doc /result - lstname=highlighting - lstname=1-1 - arrname=content strStart with the Table of 'strong'Contents'/strong'. See if you can find the topic that you are interested in. Look/str /arr /lst /lst /response
Re: server down caused by complex query
Hellow, Hoss We're using ComplexPhraseQueryParser and maxBooleanClauses setting is 100. I know maxBooleanClauses is so big. But we are expert search organization and queries are very complex and include wildcard. So we need it. Our application receives type of queries like ((A* OR B* OR C*,...) n/2 (X* OR Y* OR Z*,...)) AND (...) from user. Then it is converted into solr query like (A* X*~2 OR A* Y*~2 OR A* Z*~2 OR B* X*~2 OR ...) AND (...). Like above, queries for near expression is written repeatedly. I expect this is inefficient and why jvm memory is full. I think surround query parser may our solution. So now we are customizing surround query parser because it is very limited. Below is out tomcat setenv... == export CATALINA_OPTS=-Xms112640m -Xmx112640m export CATALINA_OPTS=$CATALINA_OPTS -Dserver export CATALINA_OPTS=$CATALINA_OPTS -Djava.library.path=/usr/local/lib:/usr/local/apr/lib export CATALINA_OPTS=$CATALINA_OPTS -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9014 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false export CATALINA_OPTS=$CATALINA_OPTS -Dfile.encoding=utf-8 export CATALINA_OPTS=$CATALINA_OPTS -XX:+UseConcMarkSweepGC == Thanks Jason -- View this message in context: http://lucene.472066.n3.nabble.com/server-down-caused-by-complex-query-tp3535506p3581218.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: sub query parsing bug???
Well, your query below becomes ref_expertise:(nonlinear OR soliton) AND default_search:optical lattice: The regular Solr/Lucene query should handle pretty much anything you can throw at it. But do be aware that Solr/Lucene syntax is not true boolean logic, you have to think in terms of SHOULD, MUST, MUST_NOT. But this works: q={!type=edismax qf='name' }(nonlinear OR soliton) AND optical lattice giving this: +(+((name:nonlinear) (name:soliton)) +(name:optical lattice)) Best Erick On Mon, Dec 12, 2011 at 3:29 PM, Steve Fuchs st...@aps.org wrote: Thanks for the reply! I do believe I have set (or have tried setting) all of those options for the default query and none of them seem to help. Anytime an OR appears inside the query the default for that query becomes OR. At least thats the anecdotal evidence I've encountered. Also in this case the results do match what the parser is telling me, so I'm not getting the results I expect. As for the second suggestion, the actual fields searched are controlled by the user, so it can get more complicated. But even in the single field search I do believe I need to use the edismax parser. I have tried the regular query syntax for searching one field and find that it can't handle the more complex queries. Something like ref_expertise:(nonlinear OR soliton) AND optical lattice won't return any documents even though there are many that satisfy those requirements. Is there some other way I could be executing this query even in the single field case? Thanks and Thanks in Advance for all help Steve On Dec 6, 2011, at 8:26 AM, Erick Erickson wrote: Hmmm, does this help? In Solr 1.4 and prior, you should basically set mm=0 if you want the equivilent of q.op=OR, and mm=100% if you want the equivilent of q.op=AND. In 3.x and trunk the default value of mm is dictated by the q.op param (q.op=AND = mm=100%; q.op=OR = mm=0%). Keep in mind the default operator is effected by your schema.xml solrQueryParser defaultOperator=xxx/ entry. In older versions of Solr the default value is 100% (all clauses must match) (from http://wiki.apache.org/solr/DisMaxQParserPlugin). I don't think you'll see the query parsed as you expect, but the results of the query should be what you expect. Tricky, eh? I'm assuming you've simplified the example for clarity and your qf will be on more than one field when you use it for real, but if not the actual query doesn't need edismax at all. Best Erick On Mon, Dec 5, 2011 at 10:52 AM, Steve Fuchs st...@aps.org wrote: Hello All, I have my field description listed below, but I don't think its pertinent. As my issue seems to be with the query parser. I'm currently using an edismax subquery clause to help with my searching as such: _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) AND \optical lattice\ translates correctly to +(+((ref_expertise:nonlinear) (ref_expertise:soliton)) +(ref_expertise:optical lattice)) but the users expect the default operator to be AND (it is in all simpler searches), however nothing I can do here gets me that same result as above when the search is: _query_:{!type=edismax qf='ref_expertise'}\(nonlinear OR soliton\) \optical lattice\ this gets converted to: +(((ref_expertise:nonlinear) (ref_expertise:soliton)) (ref_expertise:optical lattice)) where the optical lattice is optional. These produce the same results, trying q.op and mm. Also the default search term as set in the solr.config is AND. _query_:{!type=edismax q.op=AND qf='ref_expertise'}\(nonlinear OR soliton\)\optical lattice\ _query_:{!type=edismax mm=1.0 qf='ref_expertise'}\(nonlinear OR soliton\)\optical lattice\ Any ideas??? Thanks In Advance Steven Fuchs fieldType name=intl_string class=solr.TextField analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=25 / /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory / /analyzer /fieldType
Re: Reducing heap space consumption for large dictionaries?
Hi, in my index schema I has defined a DictionaryCompoundWordTokenFilterFactory and a HunspellStemFilterFactory. Each FilterFactory has a dictionary with about 100k entries. To avoid an out of memory error I have to set the heap space to 128m for 1 index. Is there a way to reduce the memory consumption when parsing the dictionary? I need to create several indexes and 128m for each index is too much. Same problem here - even with an empty index (no data yet) and two fields using Hunspell (pl_PL) I had to increase heap size to over 2GB for solr to start at all.. Stempel using the very same dictionary works fine with 128M.. -- Maciej Lisiewski
Re: Reducing heap space consumption for large dictionaries?
Hi, Its good to hear some feedback on using the Hunspell dictionaries. Lucene's support is pretty new so we're obviously looking to improve it. Could you open a JIRA issue so we can explore whether there is some ways to reduce memory consumption? On Tue, Dec 13, 2011 at 5:37 PM, Maciej Lisiewski c2h...@poczta.fm wrote: Hi, in my index schema I has defined a DictionaryCompoundWordTokenFil**terFactory and a HunspellStemFilterFactory. Each FilterFactory has a dictionary with about 100k entries. To avoid an out of memory error I have to set the heap space to 128m for 1 index. Is there a way to reduce the memory consumption when parsing the dictionary? I need to create several indexes and 128m for each index is too much. Same problem here - even with an empty index (no data yet) and two fields using Hunspell (pl_PL) I had to increase heap size to over 2GB for solr to start at all.. Stempel using the very same dictionary works fine with 128M.. -- Maciej Lisiewski -- Chris Male | Software Developer | DutchWorks | www.dutchworks.nl
RE: Trim and copy a solr field
Hi Juan, Thanks for the reply. I tried using this, but I don't see any effect of the analyzer/filter. I tried copying my Solr field to another field of the type defined below. Then I indexed couple of documents with the new schema, but I see that both fields have got the same value. Am looking at the indexed data in Luke. Am assuming that analyzers process the field value (as specified by various filters etc) and then store the modified value. Is that true ? What else could I be missing here ? Thanks and Regards, Swapna. -Original Message- From: Juan Grande [mailto:juan.gra...@gmail.com] Sent: Monday, December 12, 2011 11:50 PM To: solr-user@lucene.apache.org Subject: Re: Trim and copy a solr field Hi Swapna, You could try using a copyField to a field that uses PatternReplaceFilterFactory: fieldType class=solr.TextField name=path_location analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=(.*)/.* replacement=$1/ /analyzer /fieldType The regular expression may not be exactly what you want, but it will give you an idea of how to do it. I'm pretty sure there must be some other ways of doing this, but this is the first that comes to my mind. *Juan* On Mon, Dec 12, 2011 at 4:46 AM, Swapna Vuppala swapna.vupp...@arup.comwrote: Hi, I have a Solr field that contains the absolute path of the file that is indexed, which will be something like file:/myserver/Folder1/SubFol1/Sub-Fol2/Test.msgfile:///\\myserver\Folder1\SubFol1\Sub-Fol2\Test.msg. Am interested in indexing the location in a separate field. I was looking for some way to trim the field value from last occurrence of char /, so that I can get the location value, something like file:/myserver/Folder1/SubFol1/Sub-Fol2file:///\\myserver\Folder1\SubFol1\Sub-Fol2, and store it in a new field. Can you please suggest some way to achieve this ? Thanks and Regards, Swapna. Electronic mail messages entering and leaving Arup business systems are scanned for acceptability of content and viruses
Generic RemoveDuplicatesTokenFilter
Hi All, Currently, the SOLR's existing http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory RemoveDuplicatesTokenFilter filters the duplicate tokens with the same text and logical at the same position. In my case, if the same term appears duplicate one after the other then i need to remove all duplicates and consume only single occurance of the term (even if the positionincrementgap ==1). For e.g. the input stream is as: /quick brown brown brown fox jumps jumps over the little little lazy brown dog/ Then the output shld be: quick brown fox jumps over the little lazy brown dog. To acheive this, I implemented my own version of /RemoveDuplicatesTokenFilter/ with overridden /process()/ method as: protected Token process(Token t) throws IOException { Token nextTok = peek(1); if(t!=null nextTok!=null){ if(t.termText().equalsIgnoreCase(nextTok.termText())){ return null; } } return t; } The above implementation works as per desired and the continuous duplicates are getting removed :) Any advice/feedback for the above implementation :) Regards Pravesh -- View this message in context: http://lucene.472066.n3.nabble.com/Generic-RemoveDuplicatesTokenFilter-tp3581656p3581656.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: NRT or similar for Solr 3.5?
@Steven .. try some alternate email address(besides google/yahoo) and check your spam [image: twitter] http://twitter.com/kmarkiv[image: facebook]http://facebook.com/kmarkiv[image: google-buzz] http://profiles.google.com/kmarkiv#buzz[image: linkedin]http://linkedin.com/in/vikramkamathc Regards Vikram Kamath 2011/12/13 Steven Ou steve...@gmail.com Yeah, running Chrome on OSX and doesn't do anything. Just switched to Firefox and it works. *But*, also don't seem to be receiving confirmation email. -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 2011/12/12 vikram kamath kmar...@gmail.com The Onclick handler does not seem to be called on google chrome (Ubuntu ). Also , I dont seem to receive the email with the confirmation link on registering (I have checked my spam) Regards Vikram Kamath 2011/12/12 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: There is an onclick handler that allows you to download the src. BTW, an early access Solr 3.5 with RankingAlgorithm 1.3 (NRT) release is available for download. So please give it a try. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/10/2011 11:18 PM, Steven Ou wrote: All the links on the download section link to http://solr-ra.tgels.org/# -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 2011/12/11 Nagendra Nagarajayya nnagaraja...@transaxtions.com Steven: Not sure why you had problems, #downloads ( http://solr-ra.tgels.org/#downloads ) should point you to the downloads section showing the different versions available for download ? Please share if this is not so ( there were downloads yesterday with no problems ) Regarding NRT, you can switch between RA and Lucene at query level or at config level; in the current version with RA, NRT is in effect while with lucene, it is not, you can get more information from here: http://solr-ra.tgels.org/papers/Solr34_with_RankingAlgorithm13.pdf Solr 3.5 with RankingAlgorithm 1.3 should be available next week. Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 12/9/2011 4:49 PM, Steven Ou wrote: Hey Nagendra, I took a look and Solr-RA looks promising - but: - I could not figure out how to download it. It seems like all the download links just point to # - I wasn't looking for another ranking algorithm, so would it be possible for me to use NRT but *not* RA (i.e. just use the normal Lucene library)? -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880 On Sat, Dec 10, 2011 at 5:13 AM, Nagendra Nagarajayya nnagaraja...@transaxtions.com wrote: Steven: Please take a look at Solr with RankingAlgorithm. It offers NRT functionality. You can set your autoCommit to about 15 mins. You can get more information from here: http://solr-ra.tgels.com/wiki/**en/Near_Real_Time_Search_ver_**3.x http://solr-ra.tgels.com/wiki/en/Near_Real_Time_Search_ver_3.x Regards, - Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.**org http://rankingalgorithm.tgels.org On 12/8/2011 9:30 PM, Steven Ou wrote: Hi guys, I'm looking for NRT functionality or similar in Solr 3.5. Is that possible? From what I understand there's NRT in Solr 4, but I can't figure out whether or not 3.5 can do it as well? If not, is it feasible to use an autoCommit every 1000ms? We don't currently process *that* much data so I wonder if it's OK to just commit very often? Obviously not scalable on a large scale, but it is feasible for a relatively small amount of data? I recently upgraded from Solr 1.4 to 3.5. I had a hard time getting everything working smoothly and the process ended up taking my site down for a couple hours. I am very hesitant to upgrade to Solr 4 if it's not necessary to get some sort of NRT functionality. Can anyone help me? Thanks! -- Steven Ou | 歐偉凡 *ravn.com* | Chief Technology Officer steve...@gmail.com | +1 909-569-9880