Re: frange not working in query
The default sort is on relevance. I want to give an option to users to sort the results by date (latest on top). This works fine for queries which have few results (upto 100). However, it brings inaccurate results as soon as the figure reaches 1000s. I am trying to limit the sorting to top few results only. Hoping through frange I will be able to define the lower limit of relevance score and get better results on date sort. Is there any other way to do this? Hope its clear. - Amit On 10-Aug-2011, at 7:52 PM, simon wrote: I meant the frange query, of course On Wed, Aug 10, 2011 at 10:21 AM, simon mtnes...@gmail.com wrote: Could you tell us what you're trying to achieve with the range query ? It's not clear. -Simon On Wed, Aug 10, 2011 at 5:57 AM, Amit Sawhney sawhney.a...@gmail.com wrote: Hi All, I am trying to sort the results on a unix timestamp using this query. http://url.com:8983/solr/db/select/?indent=onversion=2.1q={!frange%20l=0.25}query($qq)qq=nokiasort=unix-timestamp%20descstart=0rows=10qt=dismaxwt=dismaxfl=*,scorehl=onhl.snippets=1 When I run this query, it says 'no field name specified in query and no defaultSearchField defined in schema.xml' As soon as I remove the frange query and run this, it starts working fine. http://url.com:8983/solr/db/select/?indent=onversion=2.1q=nokiasort=unix-timestamp%20descstart=0rows=10qt=dismaxwt=dismaxfl=*,scorehl=onhl.snippets=1 Any pointers? Thanks, Amit
Re: Solr 3.3 crashes after ~18 hours?
Hi, googling hotspot server 19.1-b02 shows that you are not alone with hanging threads and crashes. And not only with solr. Maybe try another JAVA? Bernd Am 10.08.2011 17:00, schrieb alexander sulz: Okay, with this command it hangs. Also: I managed to get a Thread Dump (attached). regards Am 05.08.2011 15:08, schrieb Yonik Seeley: On Fri, Aug 5, 2011 at 7:33 AM, alexander sulza.s...@digiconcept.net wrote: Usually you get a XML-Response when doing commits or optimize, in this case I get nothing in return, but the site ( http://[...]/solr/update?optimize=true ) DOESN'T load forever or anything. It doesn't hang! I just get a blank page / empty response. Sounds like you are doing it from a browser? Can you try it from the command line? It should give back some sort of response (or hang waiting for a response). curl http://localhost:8983/solr/update?commit=true; -Yonik http://www.lucidimagination.com I use the stuff in the example folder, the only changes i made was enable logging and changing the port to 8985. I'll try getting a thread dump if it happens again! So far its looking good with having allocated more memory to it. Am 04.08.2011 16:08, schrieb Yonik Seeley: On Thu, Aug 4, 2011 at 8:09 AM, alexander sulza.s...@digiconcept.net wrote: Thank you for the many replies! Like I said, I couldn't find anything in logs created by solr. I just had a look at the /var/logs/messages and there wasn't anything either. What I mean by crash is that the process is still there and http GET pings would return 200 but when i try visiting /solr/admin, I'd get a blank page! The server ignores any incoming updates or commits, ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com -- * Bernd FehlingUniversitätsbibliothek Bielefeld Dipl.-Inform. (FH)Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 bernd.fehl...@uni-bielefeld.de33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: How to start troubleshooting a content extraction issue
You can test the standalone content extraction with the tika-app.jar - Command to output in text format - java -jar tika-app-0.8.jar --text file_path For more options java -jar tika-app-0.8.jar --help Use the correct tika-app version jar matching the Solr build. Regards, Jayendra On Wed, Aug 10, 2011 at 1:53 PM, Tim AtLee timat...@gmail.com wrote: Hello So, I'm a newbie to Solr and Tika and whatnot, so please use simple words for me :P I am running Solr on Tomcat 7 on Windows Server 2008 r2, running as the search engine for a Drupal web site. Up until recently, everything has been fine - searching works, faceting works, etc. Recently a user uploaded a 5mb xltm file, which seems to be causing Tomcat to spike in CPU usage, and eventually error out. When the documents are submitted to be index, the tomcat process spikes up to use 100% of 1 available CPU, with the eventual error in Drupal of Exception occured sending *sites/default/files/nodefiles/533/June 30, 2011.xltm* to Solr 0 Status: Communication Error. I am looking for some help in figuring out where to troubleshoot this. I assume it's this file, but I guess I'd like to be sure - so how can I submit this file for content extraction manually to see what happens? Thanks, Tim
Need help indexing/querying a particular type of hierarchy
Hi all, I have a particular data structure I'm trying to index into a solr document so that I can query and facet it in a particular way, and I can't quite figure out the best way to go about it. One sample object is here: https://gist.github.com/1139065 The part that's tripping me up is the workflows. Each workflow has a name (in this case, digitizationWF and accessionWF). Each workflow is made up of a number of processes, each of which has its own current status. Every time the status of a process within a workflow changes, the object is reindexed. What I'd like to be able to do is present several hierarchies of facets: In one, the workflow name is the top-level facet, with the second level showing each process, under which is listed each status (completed, waiting, or error) and the number of documents with that status for that process (some values omitted for brevity): accessionWF (583) publish (583) completed (574) waiting (6) error (3) shelve (583) completed (583) etc. I'd also like to be able to invert that presentation: accessionWF (583) completed (583) publish (574) shelve (583) waiting (6) publish (6) error (3) publish (3) or even completed (583) accessionWF (583) publish (574) shelve (583) digitizationWF (583) initiate (583) error (3) accessionWF (3) shelve (3) etc. I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for, because the status values are ambiguous when not qualified by the process name -- the object itself has no completed status, only a publish:completed and a shelve:completed that I want to be able to group together into a count/list of objects with completed processes. I also don't think PathHierarchyTokenizerFactory is quite the answer either. What kind of Solr magic, if any, am I looking for here? Thanks in advance for any help or advice. Michael --- Michael B. Klein Digitization Workflow Engineer Stanford University Libraries
Re: strip html from data
I am sorry, but I do not really understand the difference of indexed and returned result set. I look on the returned dataset via this command: solr/select/?q=id:533563terms=true which gives me html tags like this ones: /bbr / I also tried to turn on TermsComponent, but it did not change anything: solr/select/?q=id:533563terms=true The shema browser does not show any html tags inside the text field, just indexed words of the one dataset. Is there a way to strip the html tags completly and not index them? If not, how to I retrieve the results without html tags? Thank you for your help. 2011/8/9 Erick Erickson erickerick...@gmail.com OK, what does not working mean? You never answered Markus' question: Are you looking at the returned result set or what you've actually indexed? Analyzers are not run on the stored data, only on indexed data. If not working means that your returned results contain the markup, then you're confusing indexing and storing. All the analysis chains operate on data sent into the indexing process. But the verbatim data is *stored* prior to (or separate from) indexing. So my assumption is that you see data returned in the document with markup, which is just as it should be, and there's no problem at all. And your actual indexed terms (try looking at the data with TermsComponent, or admin/schema browser) will NOT have any markup. Perhaps you can back up a bit and describe what's failing .vs. what you expect. Best Erick On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern merlin.morgenst...@googlemail.com wrote: Unfortunatelly I still cant get it running. The code I am using is the following: analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer I also tried this one: types fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType /types field name=text type=text indexed=true stored=true required=false/ none of those worked. I restartred solr after the shema update and reindexed the data. No change, the html tags are still in there. Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse linux. Thank you for any help on this. 2011/7/25 Mike Sokolov soko...@ifactory.com Hmm that looks like it's working fine. I stand corrected. On 07/25/2011 12:24 PM, Markus Jelsma wrote: I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom: analyzer type=index tokenizer class=solr.**WhitespaceTokenizerFactory/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords.txt/ filter class=solr.**ASCIIFoldingFilterFactory/ filter class=solr.**SnowballPorterFilterFactory protected=protwords.txt language=Dutch/ filter class=solr.**RemoveDuplicatesTokenFilterFac**tory/ charFilter class=solr.**HTMLStripCharFilterFactory/ /analyzer The analysis chain still does its job as i expect for the input: spanbla bla/span Index Analyzer org.apache.solr.analysis.**HTMLStripCharFilterFactory {luceneMatchVersion=LUCENE_34} textbla bla org.apache.solr.analysis.**WhitespaceTokenizerFactory
Re: LockObtainFailedException
Hi, When you get this exception with no other error or explananation in the logs, this is almost always because the JVM has run out of memory. Have you checked/profiled your mem usage/GC during the stream operation? On Thu, Aug 11, 2011 at 3:18 AM, Naveen Gupta nkgiit...@gmail.com wrote: Hi, We are doing streaming update to solr for multiple user, We are getting Aug 10, 2011 11:56:55 AM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint Aug 10, 2011 12:00:16 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at
Re: strip html from data
Is there a way to strip the html tags completly and not index them? If not, how to I retrieve the results without html tags? How do you push documents to solr? You need to strip html tags before the analysis chain. For example, if you are using Data Import Handler, you can use HTMLStripTransformer. http://wiki.apache.org/solr/DataImportHandler#HTMLStripTransformer
Re: how to change default response fromat as json in solr configuration?
You can set default=true in solrconfig on the JSON response writer, like this: queryResponseWriter name=json default=true class=solr.JSONResponseWriter / Or you can add str name=wtjson/str to any request handler definitions. Erik On Aug 11, 2011, at 07:36 , nagarjuna wrote: Hi everybody, when ever i enter search term in solr i am able to getting response in XML format(default),i can change that response by adding wt=json to the url .but instead of that i need to change the default format from XML to JSON..how can i do that.please help me Thanks in advance... -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-change-default-response-fromat-as-json-in-solr-configuration-tp3245629p3245629.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LockObtainFailedException
Yes this was happening because of JVM heap size But the real issue is that if our index size is growing (very high) then indexing time is taking very long (using streaming) earlier for indexing 15,000 docs at a time (commit after 15000 docs) , it was taking 3 mins 20 secs time, after deleting the index data, it is taking 9 secs What would be approach to have better indexing performance as well as index size should also at the same time. The index size was around 4.5 GB Thanks Naveen On Thu, Aug 11, 2011 at 3:47 PM, Peter Sturge peter.stu...@gmail.comwrote: Hi, When you get this exception with no other error or explananation in the logs, this is almost always because the JVM has run out of memory. Have you checked/profiled your mem usage/GC during the stream operation? On Thu, Aug 11, 2011 at 3:18 AM, Naveen Gupta nkgiit...@gmail.com wrote: Hi, We are doing streaming update to solr for multiple user, We are getting Aug 10, 2011 11:56:55 AM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint Aug 10, 2011 12:00:16 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at
RE: Building a facet query in SolrJ
Thanks! I actually found a page on line that explained this. -Rich -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Wednesday, August 10, 2011 4:01 PM To: solr-user@lucene.apache.org Cc: Simon, Richard T Subject: RE: Building a facet query in SolrJ : query.addFacetQuery(MyField + : + \ + uri + \); ... : But when I examine queryResponse.getFacetFields, it's an empty list, if facet.query constraints+counts do not come back in the facet.field section of hte response. they come back in the facet.query section of the response (look at the XML in your browser and you'll see what i mean)... https://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html#getFacetQuery%28%29 -Hoss
RE: Hudson build issues
Hi arian487, You apparently are not using the official Ant build? (Maven is officially unsupported.) The scripts used by the Lucene and Solr Jenkins builds at the ASF are available here: http://svn.apache.org/repos/asf/lucene/dev/nightly/ The ASF Jenkins jobs checkout the above directory in addition to the Lucene/Solr branch/trunk to be tested, and then invoke the appropriate script from the above directory. There are Maven build scripts there - the artifact you're looking for is installed in the local repository by calling the equivalent of: mvn -N -Pbootstrap install When the Maven jobs run under ASF Jenkins, the results are published nightly. More details here: http://wiki.apache.org/solr/NightlyBuilds Steve -Original Message- From: arian487 [mailto:akarb...@tagged.com] Sent: Wednesday, August 10, 2011 9:54 PM To: solr-user@lucene.apache.org Subject: Hudson build issues Whenever I try to build this on our hudson server it says it can't find org.apache.lucene:lucene-xercesImpl:jar:4.0-SNAPSHOT. Is the Apache repo lacking this artifact? -- View this message in context: http://lucene.472066.n3.nabble.com/Hudson- build-issues-tp3244563p3244563.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: LockObtainFailedException
Optimizing indexing time is a very different question. I'm guessing your 3mins+ time you refer to is the commit time. There are a whole host of things to take into account regarding indexing, like: number of segments, schema, how many fields, storing fields, omitting norms, caching, autowarming, search activity etc. - the list goes on... The trouble is, you can look at 100 different Solr installations with slow indexing, and find 200 different reasons why each is slow. The best place to start is to get a full understanding of precisely how your data is being stored in the index, starting with adding docs, going through your schema, Lucene segments, solrconfig.xml etc, looking at caches, commit triggers etc. - really getting to know how each step is affecting performance. Once you really have a handle on all the indexing steps, you'll be able to spot the bottlenecks that relate to your particular environment. An index of 4.5GB isn't that big (but the number of documents tends to have more of an effect than the physical size), so the bottleneck(s) should be findable once you trace through the indexing operations. On Thu, Aug 11, 2011 at 1:02 PM, Naveen Gupta nkgiit...@gmail.com wrote: Yes this was happening because of JVM heap size But the real issue is that if our index size is growing (very high) then indexing time is taking very long (using streaming) earlier for indexing 15,000 docs at a time (commit after 15000 docs) , it was taking 3 mins 20 secs time, after deleting the index data, it is taking 9 secs What would be approach to have better indexing performance as well as index size should also at the same time. The index size was around 4.5 GB Thanks Naveen On Thu, Aug 11, 2011 at 3:47 PM, Peter Sturge peter.stu...@gmail.comwrote: Hi, When you get this exception with no other error or explananation in the logs, this is almost always because the JVM has run out of memory. Have you checked/profiled your mem usage/GC during the stream operation? On Thu, Aug 11, 2011 at 3:18 AM, Naveen Gupta nkgiit...@gmail.com wrote: Hi, We are doing streaming update to solr for multiple user, We are getting Aug 10, 2011 11:56:55 AM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1097) at org.apache.solr.update.SolrIndexWriter.init(SolrIndexWriter.java:83) at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:102) at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:174) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:222) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:147) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:55) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) at org.apache.tomcat.util.net.JIoEndpoint Aug 10, 2011 12:00:16 PM org.apache.solr.common.SolrException log SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/var/lib/solr/data/index/write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:84) at
Re: frange not working in query
On Wed, Aug 10, 2011 at 5:57 AM, Amit Sawhney sawhney.a...@gmail.com wrote: Hi All, I am trying to sort the results on a unix timestamp using this query. http://url.com:8983/solr/db/select/?indent=onversion=2.1q={!frange%20l=0.25}query($qq)qq=nokiasort=unix-timestamp%20descstart=0rows=10qt=dismaxwt=dismaxfl=*,scorehl=onhl.snippets=1 When I run this query, it says 'no field name specified in query and no defaultSearchField defined in schema.xml' The default query type for embedded queries is lucene. so your qq=nokia is equivalent to qq={!lucene}nokia So one way is to explicitly make it dismax: qq={!dismax}nokia Another way is to declare the sub-query to be of type dismax: q={!frange l=0.25}query({!dismax v=$qq})qq=nokia -Yonik http://www.lucidimagination.com As soon as I remove the frange query and run this, it starts working fine. http://url.com:8983/solr/db/select/?indent=onversion=2.1q=nokiasort=unix-timestamp%20descstart=0rows=10qt=dismaxwt=dismaxfl=*,scorehl=onhl.snippets=1 Any pointers? Thanks, Amit
Re: Need help indexing/querying a particular type of hierarchy
Hi, Can you keep your hierarchy flat in SOLR and then use filter queries (fq=wf:accessionWF) inside you facet queries (facet.field=status)? Or is the requirement to have one single facet query producing the hierarchical facet counts? On Thu, Aug 11, 2011 at 10:43 AM, Michael B. Klein mbkl...@gmail.comwrote: Hi all, I have a particular data structure I'm trying to index into a solr document so that I can query and facet it in a particular way, and I can't quite figure out the best way to go about it. One sample object is here: https://gist.github.com/1139065 The part that's tripping me up is the workflows. Each workflow has a name (in this case, digitizationWF and accessionWF). Each workflow is made up of a number of processes, each of which has its own current status. Every time the status of a process within a workflow changes, the object is reindexed. What I'd like to be able to do is present several hierarchies of facets: In one, the workflow name is the top-level facet, with the second level showing each process, under which is listed each status (completed, waiting, or error) and the number of documents with that status for that process (some values omitted for brevity): accessionWF (583) publish (583) completed (574) waiting (6) error (3) shelve (583) completed (583) etc. I'd also like to be able to invert that presentation: accessionWF (583) completed (583) publish (574) shelve (583) waiting (6) publish (6) error (3) publish (3) or even completed (583) accessionWF (583) publish (574) shelve (583) digitizationWF (583) initiate (583) error (3) accessionWF (3) shelve (3) etc. I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for, because the status values are ambiguous when not qualified by the process name -- the object itself has no completed status, only a publish:completed and a shelve:completed that I want to be able to group together into a count/list of objects with completed processes. I also don't think PathHierarchyTokenizerFactory is quite the answer either. What kind of Solr magic, if any, am I looking for here? Thanks in advance for any help or advice. Michael --- Michael B. Klein Digitization Workflow Engineer Stanford University Libraries -- Regards, Dmitry Kan
Re: Solr 3.3 crashes after ~18 hours?
I know it seems like my problem may not be the same as the original poster, but in investigating this, I did find this Jetty issue that may be related: http://jira.codehaus.org/browse/JETTY-1377 Stephen Duncan Jr www.stephenduncanjr.com On Thu, Aug 4, 2011 at 1:54 PM, Stephen Duncan Jr stephen.dun...@gmail.com wrote: On Thu, Aug 4, 2011 at 10:08 AM, Yonik Seeley yo...@lucidimagination.com wrote: ignores means what? The request hangs? If so, could you get a thread dump? Do queries work (like /solr/select?q=*:*) ? thous throwing no errors, no 503's.. It's like the server has a blackout and stares blankly into space. Are you using a different servlet container than what is shipped with solr? If you did start with the solr example server, what jetty configuration changes have you made? -Yonik http://www.lucidimagination.com We're seeing something similar here. Not sure exactly what the circumstances are, but occasionally our Solr 3.3 test instance is hanging, nothing seems to be happening for several minutes. It does seem to be happening while data is being added and continuous queries are being sent. It also may be related to an optimize happening (we attempt to optimize after adding all the new data from our database). The last log message is: 2011-08-04 13:46:56,418 [qtp30604342-451] INFO org.apache.solr.core.SolrCore - [report] webapp= path=/update params={optimize=truewaitSearcher=truemaxSegments=1waitFlush=truewt=javabinversion=2} status=0 QTime=109109 Here is our thread dump: 2011-08-04 13:47:16 Full thread dump Java HotSpot(TM) Client VM (20.1-b02 mixed mode): RMI TCP Connection(13)-172.16.10.102 daemon prio=6 tid=0x47a4a400 nid=0x1384 runnable [0x4861f000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked 0x183a55a0 (a java.io.BufferedInputStream) at java.io.FilterInputStream.read(FilterInputStream.java:66) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:517) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - 0x183a7c68 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) qtp30604342-451 prio=6 tid=0x475c4800 nid=0x1a58 waiting on condition [0x4897f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x18214c08 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None qtp30604342-450 prio=6 tid=0x47ad1c00 nid=0x1ca4 waiting on condition [0x49d2f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x18214c08 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:320) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:512) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:38) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:558) at java.lang.Thread.run(Thread.java:662) Locked ownable synchronizers: - None qtp30604342-449 prio=6 tid=0x47a57c00 nid=0xb2c waiting on condition [0x49c2f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) -
NRT in Master- Slave setup, crazy?
Thinking aloud and grateful for sparing .. I need to support high commit rate (low update latency) in a master slave setup and I have a bad feelings about it, even with disabling warmup and stripping everything down that slows down refresh. I will try it anyway, but I started thinking about backup plan, like NRT on slaves. An idea is to have Master working on disk, doing commits in throughput friendly manner (e.g. every 5-10 minutes), but let slaves do the same updates with softCommit I am basically going to let slaves possibly run out of sync with master, by issuing the same updates on all slaves with softCommit ... every now and than syncing with Master. Could this work? the trick is, index is big (can fit in Ca. 16-20G Ram), but update rate is small and ugly distributed in time (every couple of seconds a few documents), one hard commit on master + slave update would probably cost much more than add(document) with softCommit on every slave (2-5 of them) So all in all, master remains real master and is there to ensure: a ) seeding if slave restarts b) authoritative index master, if slaves run out of sync (small diff is ok if they get corrected once a day) In general, do you find such idea wrong for some reason, should I be doing something else/better to achieve low update latency in master slave (for low update throughput)? Anything I can do to make standard master slave latency better apart from disabling warmup? Would loading os ramdisk (tmpfs forced in ram) on slaves bring much. I am talking about Ca. 1 second (plus/minus) update latency target from update to search on slave... But not more than 0.5 - 2 updates every second. And what I so far understood how solr works, this is going to be possible only with NRT on slaves (Analysis in my case is fast, so not an issue)...
SolR : Spellchecking Autocomplete
Hello, I posted on the Lucene Forums, and someone told me to e-mail it here. Instead of writing again my question here, I take the liberty to link my post. Its about SolR, autocompletion, Spellchecking and case-sentivieness (?). http://lucene.472066.n3.nabble.com/SolR-Spellchecking-amp-Autocomplete-td3243107.html Thanks for all, Valentin
Re: Solr 3.3: DIH configuration for Oracle
On 8/10/2011 2:52 PM, Eugeny Balakhonov wrote: java.lang.IllegalArgumentException: deltaQuery has no column to resolve to declared primary key pk='T1_ID_RECORD, T2_ID_RECORD' I have analyzed the source code of DIH. I found that in the DocBuilder class collectDelta() method works with value of entity attribute pk as with simple string. But in my case this is array with two values: T1_ID_RECORD, T2_ID_RECORD Whatever you declare as the DIH primary key must exist as a field name in the result set, or Solr will complain. I had a perfectly working config in 1.4.1, with identical text in query and deltaImportQuery. It didn't work when I tried to upgrade to 3.1. The problem was that I was using a deltaQuery that just returned MAX(did), to tell Solr that something needed to be done. I had to add AS did to the deltaQuery so that it matched my primary key. I am controlling the delta-import from outside Solr, so I do not need to use the result set from deltaQuery. The point is to pick something that will exist in all of your result sets. You might need to include an AS xxx (with something you choose for xxx) in your queries and use the xxx value as your pk. Because you have only provided a simple example, I can't really tell you what you should use. The pk value is only used to coordinate your queries. It only has meaning in the DIH, not the Solr index. Uniqueness in the Solr index is controlled by the uniqueKey value in schema.xml. In my case, pk and uniqueKey are not the same field. Side note: I'm not much of an expert, so I can't guarantee I can help further. I will give it a try, though. Thanks, Shawn
copyfields in schema.xml
Hi all. if in schema.xml we put something like: field name=title type=string indexed=false stored=false multiValued=true/ field name=titulo type=string indexed=true stored=true multiValued=true/ field name=text type=text_general indexed=true stored=false multiValued=true/ copyField source=title dest=titulo/ copyField source=titulo dest=text/ Can I expect that in 'text' field I have the 'title' and the 'titulo' contents ? thanks ;) Note: in our app, the titles refer to books that can be named in several different ways . --- Rode González _ No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1392 / Base de datos de virus: 1520/3826 - Fecha de publicación: 08/10/11
RE: copyfields in schema.xml
Nope. The 'text' field will just have the 'titulo' contents. To have both, you would have to do something like this: copyField source=title dest=titulo/ copyField source=title dest=text/ copyField source=titulo dest=text/ -Michael
RE: Hudson build issues
I downloaded the official build (4.0) and I've been customizing it for my needs. I'm not really sure how to use these scripts. Is there somewhere in Hudson where I can apply these scripts or something? -- View this message in context: http://lucene.472066.n3.nabble.com/Hudson-build-issues-tp3244563p3246645.html Sent from the Solr - User mailing list archive at Nabble.com.
need some guidance about how to configure a specific solr solution.
Hi There, I am IT and work on a project based on Liferary 605 with solr-3.2 like the indexer/search engine. I presently have only one server that is indexing and searching but reading the Liferay Support suggestions they point to the need of having: - 2 to n SOLR read-server for searching from any member of the liferay cluster - 1 SOLR write-server where all liferay cluster members write. However, going down to detail to implement that on the liferay side I think I know how to do that which is inserting into the plugin for Solr this entries solr-spring.xml in the WEB-INF/classes/META-INF folder. Open this file in a text editor and you will see that there are two entries which define where the Solr server can be found by Liferay: bean id=indexSearcher class=com.liferay.portal.search.solr.SolrIndexSearcherImpl property name=serverURL value=http://localhost:8080/solr/select; / /bean bean id=indexWriter class=com.liferay.portal.search.solr.SolrIndexWriterImpl property name=serverURL value=http://localhost:8080/solr/update; / /bean However, I don't know how to replicate the writer solr server content into the readers. Please can you provide advice about that? Thanks, Pablo This e-mail may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this e-mail may not be that of the organization.
Searching For Term 'OR'
Hello, I am looking for some advice on how to index and search a field that contains a two character state name without the query parser dying on the OR and also not treating it as an 'OR' Boolean operator. For example: The following query with a filter query key/value pair causes an exception: q=*:*fq=(state:OR) Caused by: org.apache.lucene.queryParser.ParseException: Encountered OR OR at line 1, column 7. Was expecting one of: ( ... * ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... Note: we had the same issue with Indiana (IN), but removing that stop word fixed it. Removing the stopword 'or', has not helped. The field itself is indexed and stored as string field during indexing. field name=state type=string indexed=true stored=true/ Thanks in advance, John Brewer
Re: Searching For Term 'OR'
I guess this is because Lucene QP is interpreting the 'OR' operator. You can either: use lowercase use other query parser, like the term query parser. See http://lucene.apache.org/solr/api/org/apache/solr/search/TermQParserPlugin.html Also, if you just removed the or term from the stopwords, you'll probably have to reindex if you want it in the index. Regards, Tomás On Thu, Aug 11, 2011 at 2:38 PM, John Brewer john.bre...@atozdatabases.comwrote: Hello, I am looking for some advice on how to index and search a field that contains a two character state name without the query parser dying on the OR and also not treating it as an 'OR' Boolean operator. For example: The following query with a filter query key/value pair causes an exception: q=*:*fq=(state:OR) Caused by: org.apache.lucene.queryParser.ParseException: Encountered OR OR at line 1, column 7. Was expecting one of: ( ... * ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... Note: we had the same issue with Indiana (IN), but removing that stop word fixed it. Removing the stopword 'or', has not helped. The field itself is indexed and stored as string field during indexing. field name=state type=string indexed=true stored=true/ Thanks in advance, John Brewer
Re: Searching For Term 'OR'
Thanks for the feedback. I'll give these a try. Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I guess this is because Lucene QP is interpreting the 'OR' operator. You can either: use lowercase use other query parser, like the term query parser. See http://lucene.apache.org/solr/api/org/apache/solr/search/TermQParserPlugin.html Also, if you just removed the or term from the stopwords, you'll probably have to reindex if you want it in the index. Regards, Tomás On Thu, Aug 11, 2011 at 2:38 PM, John Brewer john.bre...@atozdatabases.comwrote: Hello, I am looking for some advice on how to index and search a field that contains a two character state name without the query parser dying on the OR and also not treating it as an 'OR' Boolean operator. For example: The following query with a filter query key/value pair causes an exception: q=*:*fq=(state:OR) Caused by: org.apache.lucene.queryParser.ParseException: Encountered OR OR at line 1, column 7. Was expecting one of: ( ... * ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... Note: we had the same issue with Indiana (IN), but removing that stop word fixed it. Removing the stopword 'or', has not helped. The field itself is indexed and stored as string field during indexing. field name=state type=string indexed=true stored=true/ Thanks in advance, John Brewer
RE: Searching For Term 'OR'
hi, use the filter LowerCaseFilterFactory (don't work with string type, you must create a new fieldtype of text type) or use scaped forms: \OR \AND I tried it a moment ago and it works. saludos --- Rode González -Mensaje original- De: Tomás Fernández Löbbe [mailto:tomasflo...@gmail.com] Enviado el: jueves, 11 de agosto de 2011 19:58 Para: solr-user@lucene.apache.org Asunto: Re: Searching For Term 'OR' I guess this is because Lucene QP is interpreting the 'OR' operator. You can either: use lowercase use other query parser, like the term query parser. See http://lucene.apache.org/solr/api/org/apache/solr/search/TermQParserPlu gin.html Also, if you just removed the or term from the stopwords, you'll probably have to reindex if you want it in the index. Regards, Tomás On Thu, Aug 11, 2011 at 2:38 PM, John Brewer john.bre...@atozdatabases.comwrote: Hello, I am looking for some advice on how to index and search a field that contains a two character state name without the query parser dying on the OR and also not treating it as an 'OR' Boolean operator. For example: The following query with a filter query key/value pair causes an exception: q=*:*fq=(state:OR) Caused by: org.apache.lucene.queryParser.ParseException: Encountered OR OR at line 1, column 7. Was expecting one of: ( ... * ... QUOTED ... TERM ... PREFIXTERM ... WILDTERM ... [ ... { ... NUMBER ... Note: we had the same issue with Indiana (IN), but removing that stop word fixed it. Removing the stopword 'or', has not helped. The field itself is indexed and stored as string field during indexing. field name=state type=string indexed=true stored=true/ Thanks in advance, John Brewer - No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1392 / Base de datos de virus: 1520/3827 - Fecha de publicación: 08/11/11 - No se encontraron virus en este mensaje. Comprobado por AVG - www.avg.com Versión: 10.0.1392 / Base de datos de virus: 1520/3827 - Fecha de publicación: 08/11/11
Re: Searching For Term 'OR'
: I am looking for some advice on how to index and search a field that : contains a two character state name without the query parser dying on : the OR and also not treating it as an 'OR' Boolean operator. fq={!term f=state}OR ...this kind of filter you don't want a query parser that has any metacharacters. -Hoss
RE: Searching For Term 'OR'
Thanks for the advice everyone. I am rebuilding the index with a lowercase field instead of string. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, August 11, 2011 1:10 PM To: solr-user@lucene.apache.org Subject: Re: Searching For Term 'OR' : I am looking for some advice on how to index and search a field that : contains a two character state name without the query parser dying on : the OR and also not treating it as an 'OR' Boolean operator. fq={!term f=state}OR ...this kind of filter you don't want a query parser that has any metacharacters. -Hoss
Re: strip html from data
You can use charFilter class=solr.HTMLStripCharFilterFactory/ like here in this example. Check the docs about your specific SOLR version because something has changed in the htmlstrip syntax in 1.4 and 3.x fieldType name=text class=solr.TextField positionIncrementGap=100 tokenizer class=solr.WhitespaceTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ /fieldType 2011/8/11 Merlin Morgenstern merlin.morgenst...@googlemail.com I am sorry, but I do not really understand the difference of indexed and returned result set. I look on the returned dataset via this command: solr/select/?q=id:533563terms=true which gives me html tags like this ones: /bbr / I also tried to turn on TermsComponent, but it did not change anything: solr/select/?q=id:533563terms=true The shema browser does not show any html tags inside the text field, just indexed words of the one dataset. Is there a way to strip the html tags completly and not index them? If not, how to I retrieve the results without html tags? Thank you for your help. 2011/8/9 Erick Erickson erickerick...@gmail.com OK, what does not working mean? You never answered Markus' question: Are you looking at the returned result set or what you've actually indexed? Analyzers are not run on the stored data, only on indexed data. If not working means that your returned results contain the markup, then you're confusing indexing and storing. All the analysis chains operate on data sent into the indexing process. But the verbatim data is *stored* prior to (or separate from) indexing. So my assumption is that you see data returned in the document with markup, which is just as it should be, and there's no problem at all. And your actual indexed terms (try looking at the data with TermsComponent, or admin/schema browser) will NOT have any markup. Perhaps you can back up a bit and describe what's failing .vs. what you expect. Best Erick On Mon, Aug 8, 2011 at 6:50 AM, Merlin Morgenstern merlin.morgenst...@googlemail.com wrote: Unfortunatelly I still cant get it running. The code I am using is the following: analyzer type=index charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory/ filter class=solr.PorterStemFilterFactory/ /analyzer I also tried this one: types fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StandardFilterFactory/ /analyzer /fieldType /types field name=text type=text indexed=true stored=true required=false/ none of those worked. I restartred solr after the shema update and reindexed the data. No change, the html tags are still in there. Any other ideas? Maybe this is a bug in solr? I am using solr 3.3.0 on suse linux. Thank you for any help on this. 2011/7/25 Mike Sokolov soko...@ifactory.com Hmm that looks like it's working fine. I stand corrected. On 07/25/2011 12:24 PM, Markus Jelsma wrote: I've seen that issue too and read comments on the list yet i've never had trouble with the order, don't know what's going on. Check this analyzer, i've moved the charFilter to the bottom: analyzer type=index tokenizer class=solr.**WhitespaceTokenizerFactory/ filter class=solr.**WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.**LowerCaseFilterFactory/ filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt ignoreCase=false expand=true/ filter class=solr.StopFilterFactory ignoreCase=false words=stopwords.txt/
Re: Need help indexing/querying a particular type of hierarchy
I've been experimenting with that, but that fq wouldn't limit my facet counts adequately. Since the document has both an accessionWF and a digitizationWF, the fq would match (and count) the document no matter what the status for each process. I suppose I could do something like this: field name=status_wpsaccessionWF:start-accession:completed/field field name=status_wpsaccessionWF:cleanup:waiting/field field name=status_wpsaccessionWF:descriptive-metadata:completed/field field name=status_wpsaccessionWF:content-metadata:completed/field field name=status_wpsaccessionWF:rights-metadata:completed/field field name=status_wpsaccessionWF:publish:completed/field field name=status_wpsaccessionWF:shelve:error/field field name=status_wspaccessionWF:completed:start-accession/field field name=status_wspaccessionWF:waiting:cleanup/field field name=status_wspaccessionWF:completed:descriptive-metadata/field field name=status_wspaccessionWF:completed:content-metadata/field field name=status_wspaccessionWF:completed:rights-metadata/field field name=status_wspaccessionWF:completed:publish/field field name=status_wspaccessionWF:error:shelve/field field name=status_swpcompleted:accessionWF:start-accession/field field name=status_swpwaiting:accessionWF:cleanup/field field name=status_swpcompleted:accessionWF:descriptive-metadata/field field name=status_swpcompleted:accessionWF:content-metadata/field field name=status_swpcompleted:accessionWF:rights-metadata/field field name=status_swpcompleted:accessionWF:publish/field field name=status_swperror:accessionWF:shelve/field and use a PathHierarchyTokenizerFactory with : as the delimiter. Then I could use facet.field=status_wpsf.status_wps.facet.prefix=accessionWF: to get the counts for all the accessionWF processes and statuses, then repeat using status_wsp and status_swp for the various inversions. I was hoping for something easier. :) On Thu, Aug 11, 2011 at 6:40 AM, Dmitry Kan dmitry@gmail.com wrote: Hi, Can you keep your hierarchy flat in SOLR and then use filter queries (fq=wf:accessionWF) inside you facet queries (facet.field=status)? Or is the requirement to have one single facet query producing the hierarchical facet counts? On Thu, Aug 11, 2011 at 10:43 AM, Michael B. Klein mbkl...@gmail.com wrote: Hi all, I have a particular data structure I'm trying to index into a solr document so that I can query and facet it in a particular way, and I can't quite figure out the best way to go about it. One sample object is here: https://gist.github.com/1139065 The part that's tripping me up is the workflows. Each workflow has a name (in this case, digitizationWF and accessionWF). Each workflow is made up of a number of processes, each of which has its own current status. Every time the status of a process within a workflow changes, the object is reindexed. What I'd like to be able to do is present several hierarchies of facets: In one, the workflow name is the top-level facet, with the second level showing each process, under which is listed each status (completed, waiting, or error) and the number of documents with that status for that process (some values omitted for brevity): accessionWF (583) publish (583) completed (574) waiting (6) error (3) shelve (583) completed (583) etc. I'd also like to be able to invert that presentation: accessionWF (583) completed (583) publish (574) shelve (583) waiting (6) publish (6) error (3) publish (3) or even completed (583) accessionWF (583) publish (574) shelve (583) digitizationWF (583) initiate (583) error (3) accessionWF (3) shelve (3) etc. I don't think Solr 4.0's pivot/hierarchical facets are what I'm looking for, because the status values are ambiguous when not qualified by the process name -- the object itself has no completed status, only a publish:completed and a shelve:completed that I want to be able to group together into a count/list of objects with completed processes. I also don't think PathHierarchyTokenizerFactory is quite the answer either. What kind of Solr magic, if any, am I looking for here? Thanks in advance for any help or advice. Michael --- Michael B. Klein Digitization Workflow Engineer Stanford University Libraries -- Regards, Dmitry Kan
Re: how to ignore case in solr search field?
Here's an example. Since I only query this for spelling, i can lowecase both on index and query time. fieldType name=textSpell class=solr.TextField positionIncrementGap=10 stored=false multiValued=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ charFilter class=solr.HTMLStripCharFilterFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.StandardFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType 2011/8/10 nagarjuna nagarjuna.avul...@gmail.com Hi please help me .. how to ignore case while searching in solr ex:i need same results for the keywords abc, ABC , aBc,AbC and all the cases. Thank u in advance -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-ignore-case-in-solr-search-field-tp3242967p3242967.html Sent from the Solr - User mailing list archive at Nabble.com. -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: bug in termfreq? was Re: is it possible to do a sort without query?
are you boosting your docs? 2011/8/8 Jason Toy jason...@gmail.com I am trying to test out and compare different sorts and scoring. When I use dismax to search for indie music with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100 I see some stuff that seems irrelevant, meaning in top results I see only 1 or 2 mentions of indie music, but when I look further down the list I do see other docs that have more occurrences of indie music. So I a want to test by comparing the the different queries versus seeing a list of docs ranked specifically by the count of occurrences of the phrase indie music On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.io wrote: Dismax queries can. But sort=termfreq(all_lists_text,'indie+music') is not using dismax. Apparenty termfreq function can not? I am not familiar with the termfreq function. It simply returns the TF of the given _term_ as it is indexed of the current document. Sorting on TF like this seems strange as by default queries are already sorted that way since TF plays a big role in the final score. To understand why you'd need to reindex, you might want to read up on how lucene actually works, to get a basic understanding of how different indexing choices effect what is possible at query time. Lucene In Action is a pretty good book. On 8/8/2011 5:02 PM, Jason Toy wrote: Are not Dismax queries able to search for phrases using the default index(which is what I am using?) If I can already do phrase searches, I don't understand why I would need to reindex t be able to access phrases from a function. On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote: Aelexei, thank you , that does seem to work. My sort results seem to be totally wrong though, I'm not sure if its because of my sort function or something else. My query consists of: sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100 And I get back 4571232 hits. That's normal, you issue a catch all query. Sorting should work but.. All the results don't have the phrase indie music anywhere in their data. Does termfreq not support phrases? No, it is TERM frequency and indie music is not one term. I don't know how this function parses your input but it might not understand your + escape and think it's one term constisting of exactly that. If not, how can I sort specifically by termfreq of a phrase? You cannot. What you can do is index multiple terms as one term using the shingle filter. Take care, it can significantly increase your index size and number of unique terms. On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko ale...@superdownloads.com.br wrote: You can use the standard query parser and pass q=*:* 2011/8/8 Jason Toyjason...@gmail.com I am trying to list some data based on a function I run , specifically termfreq(post_text,'indie music') and I am unable to do it without passing in data to the q paramater. Is it possible to get a sorted list without searching for any terms? -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533 -- - sent from my mobile 6176064373 -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: unique terms and multi-valued fields
Thant makes sense. There are actually stored fields. I was mostly just trying to figure out how much my index size might grow. These fields I am dealing with are large and repetitive (but mixed). From: Erick Erickson erickerick...@gmail.com To: solr-user@lucene.apache.org; Kevin Osborn osbo...@yahoo.com Sent: Wednesday, August 10, 2011 7:08 AM Subject: Re: unique terms and multi-valued fields Well, it depends (tm). If you're talking about *indexed* terms, then the value is stored only once in both the cases you mentioned below. There's really very little difference between a non-multi-valued field and a multi-valued field in terms of how it's stored in the searchable portion of the index, except for some position information. So, having an XML doc with a single-valued field field name=categorycomputers laptops/field is almost identical (except for position info as positionIncrementGap) as a field name=categorycomputers/field field name=categorylaptops/field multiValued refers to the *input*, not whether more than one word is allowed in that field. Now, about *stored* fields. If you store the data, verbatim copies are kept in the storage-specific files in each segment, and the values will be on disk for each document. But you probably don't care much because this data is only referenced when you assemble a document for return to the client, it's irrelevant for searching. Best Erick On Tue, Aug 9, 2011 at 8:02 PM, Kevin Osborn osbo...@yahoo.com wrote: Please verify my understanding. I have a field called category and it has a value computers. If I use this same field and value for all of my documents, it is really only stored on disk once because category:computers is a unique term. Is this correct? But, what about multi-valued fields. So, I have a field called category. For 100 documents, it has the values computers and laptops. For 100 other documents, it has the values computers and tablets. Is this stored as category:computers, category:laptops, category:tablets, meaning 3 unique terms. Or is it stored as category:computers,laptops and category:computers,tablets. I believe it is the first case (hopefully), but I am not sure. Thanks.
Re: Unbuffered entity enclosing request can not be repeated Invalid chunk header
Hi, We see these errors too once on a while but there is real answer on the mailing list here except one user suspecting Tomcat is responsible (connection time outs). Another user proposed to limit the number of documents per batch but that, of course, increases the number of connections made. We do only 250 docs/batch to limit RAM usage on the client and start to see these errors very occasionally. There may be a coincidence.. or not. Anyway, it's really hard to reproduce if not impossible. It happens when connecting directly as well when connecting through a proxy. What you can do is simply retry the batch and it usually works out fine. At least you don't loose a batch in the process. We retry all failures at least a couple of times before giving up an indexing job. Cheers, Hello folks, i use solr 1.4.1 and every 2 to 6 hours i have indexing errors in my log files. on the client side: 2011-08-04 12:01:18,966 ERROR [Worker-242] IndexServiceImpl - Indexing failed with SolrServerException. Details: org.apache.commons.httpclient.ProtocolException: Unbuffered entity enclosing request can not be repeated.: Stacktrace: org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHtt pSolrServer.java:469) . . on the server side: INFO: [] webapp=/solr path=/update params={wt=javabinversion=1} status=0 QTime=3 04.08.2011 12:01:18 org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 0 04.08.2011 12:01:18 org.apache.solr.common.SolrException log SCHWERWIEGEND: org.apache.solr.common.SolrException: java.io.IOException: Invalid chunk header . . . i`m indexing ONE document per call, 15-20 documents per second, 24/7. what may be the problem? best regards vadim
Why is boost not always listed in explain when debug is on?
using Solr Specification Version: 4.0.0.2011.08.09.11.02.13 While trying understand scoring I noticed that boost is intermittently displayed in the explain. For example, using edismax and the query string is q=Starbucksqf=name.search name^2 my first result has the boost explicitly listed in the explain as 2.0=boost. When I change the boost to 20, however, I no longer see the boost listed. Should boost be displayed in both cases? Any help understanding this behavior would be greatly appreciated. Thanks! Boost of 2 f278968e-b2c6-4bbd-8e69-85ab938fa554: 8.609146 = (MATCH) max of: 8.609146 = (MATCH) weight(name:starbucks^2.0 in 163) [DefaultSimilarity], result of: 8.609146 = score(doc=163,freq=1.0 = termFreq=1 ), product of: 0.9994 = queryWeight, product of: 2.0 = boost 8.609147 = idf(docFreq=8644, maxDocs=17433139) 0.05807776 = queryNorm 8.609147 = fieldWeight in 163, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1 8.609147 = idf(docFreq=8644, maxDocs=17433139) 1.0 = fieldNorm(doc=163) 4.278918 = (MATCH) weight(name.search:starbuck in 163) [DefaultSimilarity], result of: 4.278918 = score(doc=163,freq=1.0 = termFreq=1 ), product of: 0.49850774 = queryWeight, product of: 8.583453 = idf(docFreq=8869, maxDocs=17433139) 0.05807776 = queryNorm 8.583453 = fieldWeight in 163, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1 8.583453 = idf(docFreq=8869, maxDocs=17433139) 1.0 = fieldNorm(doc=163) Boost of 20 f278968e-b2c6-4bbd-8e69-85ab938fa554: 8.609147 = (MATCH) max of: 8.609147 = (MATCH) weight(name:starbucks^20.0 in 163) [DefaultSimilarity], result of: 8.609147 = fieldWeight in 163, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1 8.609147 = idf(docFreq=8644, maxDocs=17433139) 1.0 = fieldNorm(doc=163) 0.42789182 = (MATCH) weight(name.search:starbuck in 163) [DefaultSimilarity], result of: 0.42789182 = score(doc=163,freq=1.0 = termFreq=1 ), product of: 0.049850777 = queryWeight, product of: 8.583453 = idf(docFreq=8869, maxDocs=17433139) 0.0058077765 = queryNorm 8.583453 = fieldWeight in 163, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1 8.583453 = idf(docFreq=8869, maxDocs=17433139) 1.0 = fieldNorm(doc=163) -- View this message in context: http://lucene.472066.n3.nabble.com/Why-is-boost-not-always-listed-in-explain-when-debug-is-on-tp3247505p3247505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SEVERE: org.apache.solr.common.SolrException: Error loading class 'solr.ICUTokenizerFactory'
: I copied the file apache-solr-analysis-extras-3.3.0.jar into solr's lib : folder. Now the error is different - ... : I also added the following files to my apache-solr-3.3.0\example\lib : folder: Deja-Vu... http://www.lucidimagination.com/search/document/5967b87c6fa56fd1/error_loading_a_custom_request_handler_in_solr_4_0 And another blast from the past (all the details still acurate)... http://www.lucidimagination.com/search/document/ef9f4bd49f8b3576/fw_customanalyzer_class_not_loaded_error -Hoss
Re: Dates off by 1 day?
: In Solr the date is stored as Zulu time zone and Solrj is returning date in : CDT timezone (jvm is picking system time zone.) Strictly speaking, Solrj is not returning the date in CDT timezone ... Date objects in java are absolute moments in time, that know nothing about timezones. Where the system time zone of your client comes into play is when you do an implicit conversion to a String because of the + operator... : System.out.println(-- + resultDoc.getFieldValue(FILE_DATE)); http://download.oracle.com/javase/6/docs/api/java/util/Date.html#toString%28%29 -Hoss
Timeout trying to index from nutch
I am new user and I have SOLR installed. I can use the admin page and query the example data. However, I was using nutch to load index with intranet web pages and I got this message. SolrIndexer: starting at 2011-08-12 16:52:44 org.apache.solr.client.solrj.SolrServerException: java.net.ConnectException: Connection timed out Timeout happened after about 12 minutes. I cant seem to find this message in an archive search. Can anyone give me some clues? Notice: This email and any attachments are confidential. If received in error please destroy and immediately notify us. Do not copy or disclose the contents.