Thought exercise: features for Solr client
Hello, I am trying to imagine what would a new, fresh, Solr client library look like. There has been a number of features added to Solr recently, so some of the older libraries do not necessarily support them as well (e.g. multi-collections, soft commits, multiple handler end-points, schema auto-discovery, etc). If one were to write a new client, what would a useful version 1 would look like for modern Solr? At the moment, I am not talking of a specific implementation language. Stil, if you have any thoughts on that, they are welcome too. My own thought center around two directions that a library would need to support: 1) Indexing on the backend 2) Middle-layers between the website and Solr doing some sort of query security, enhancement, normalization, etc Any thoughts? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: exceeded limit of maxWarmingSearchers ERROR
Hi Naveen, Iam also getting the similar problem where I do not know how to use the commitWithin Tag, can you help me how to use commitWithin Tag. can you give me the example -- View this message in context: http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-ERROR-tp3252844p4100864.html Sent from the Solr - User mailing list archive at Nabble.com.
Configure maxConnectionsPerHost
Hi, Where can I configure the maxConnectionsPerHost on Solr? I'm using Solr 4.5.1 with the old style of solr.xml (I have a lot of collections and switch to the new style of solr.xml is too much work) - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Configure-maxConnectionsPerHost-tp4100870.html Sent from the Solr - User mailing list archive at Nabble.com.
Optimizing cores in SolrCloud
A few weeks ago optimization in SolrCloud was discussed in this thred: http://lucene.472066.n3.nabble.com/SolrCloud-optimizing-a-core-triggers-optimization-of-another-td4097499.html#a4098020 The thread was covering the distributed optimization inside a collection. My use case requires manually running optimizations every week or so, because I do delete by query often, and deletedDocs number gets to huge amounts, and the only way to regain that space is by optimizing. Since I have a pretty steady high load, I can't do it over night and i was thinking to do it one core at a time - meaning optimizing shard1_replica1 and then shard1_replica2 and so on, using curl 'http://localhost:8983/solr/collection1_shard1_replica1/update?optimize=truedistrib=false' My question is how would this reflect on the performance of the system? All queries that would be routed to that shard replica would be very slow I assume. Would there be any problems if a replica is optimized and another is not? Anybody tried something like this? Any tips or stories ? Thank you! - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-cores-in-SolrCloud-tp4100871.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Thought exercise: features for Solr client
Here goes my wishlist: - Transaction management - Access control at document level Regards. On Thu, Nov 14, 2013 at 10:35 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: Hello, I am trying to imagine what would a new, fresh, Solr client library look like. There has been a number of features added to Solr recently, so some of the older libraries do not necessarily support them as well (e.g. multi-collections, soft commits, multiple handler end-points, schema auto-discovery, etc). If one were to write a new client, what would a useful version 1 would look like for modern Solr? At the moment, I am not talking of a specific implementation language. Stil, if you have any thoughts on that, they are welcome too. My own thought center around two directions that a library would need to support: 1) Indexing on the backend 2) Middle-layers between the website and Solr doing some sort of query security, enhancement, normalization, etc Any thoughts? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Thought exercise: features for Solr client
I think there is a place for a client-side query hierarchy. It would be nice if you could build a Lucene Query and the Solr client would serialize it for you. If there were a general-purpose query serialization library then you could support a similar programming model for Lucene-only and with Solr. It would be useful for all kinds of things, since you wouldn't be tied to the query parser zoo. The XML QP is a possible starting place for a serialization format, but I think ultimately to do this, Query would have to add support for some kind of generic representation (eg a map of children which could be primitives or queries). -Mike On 11/14/13 4:35 AM, Alexandre Rafalovitch wrote: Hello, I am trying to imagine what would a new, fresh, Solr client library look like. There has been a number of features added to Solr recently, so some of the older libraries do not necessarily support them as well (e.g. multi-collections, soft commits, multiple handler end-points, schema auto-discovery, etc). If one were to write a new client, what would a useful version 1 would look like for modern Solr? At the moment, I am not talking of a specific implementation language. Stil, if you have any thoughts on that, they are welcome too. My own thought center around two directions that a library would need to support: 1) Indexing on the backend 2) Middle-layers between the website and Solr doing some sort of query security, enhancement, normalization, etc Any thoughts? Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
RE: distributed search is significantly slower than direct search
Hi, We tried returning just the id field and got exactly the same performance. Our system is distributed but all shards are in a single machine so network issues are not a factor. The code we found where Solr is spending its time is on the shard and not on the routing core, again all shards are local. We investigated the getFirstMatch() method and noticed that the MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact take 99% of the time. Inside these methods, the call to BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.loadBlock takes most of the time. Out of the 7 seconds run these methods take ~5 and BinaryResponseWriter.write takes the rest(~ 2 seconds). We tried increasing cache sizes and got hits, but it only improved the query time by a second (~6), so no major effect. We are not indexing during our tests. The performance is similar. (How do we measure doc size? Is it important due to the fact that the performance is the same when returning only id field?) We still don't completely understand why the query takes this much longer although the cores are on the same machine. Is there a way to improve the performance (code, configuration, query)? -Original Message- From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of Manuel Le Normand Sent: Thursday, November 14, 2013 1:30 AM To: solr-user@lucene.apache.org Subject: Re: distributed search is significantly slower than direct search It's surprising such a query takes a long time, I would assume that after trying consistently q=*:* you should be getting cache hits and times should be faster. Try see in the adminUI how do your query/doc cache perform. Moreover, the query in itself is just asking the first 5000 docs that were indexed (returing the first [docid]), so seems all this time is wasted on transfer. Out of these 7 secs how much is spent on the above method? What do you return by default? How big is every doc you display in your results? Might be the matter that both collections work on the same ressources. Try elaborating your use-case. Anyway, it seems like you just made a test to see what will be the performance hit in a distributed environment so I'll try to explain some things we encountered in our benchmarks, with a case that has at least the similarity of the num of docs fetched. We reclaim 2000 docs every query, running over 40 shards. This means every shard is actually transfering to our frontend 2000 docs every document-match request (the first you were referring to). Even if lazily loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a tough job. Waiting for the slowest shard to respond, then sorting the docs and reloading (lazy or not) the top 2000 docs might take a long time. Our times are 4-8 secs, but do it's not possible comparing cases. We've done few steps that improved it along the way, steps that led to others. These were our starters: 1. Profile these queries from different servers and solr instances, try putting your finger what collection is working hard and why. Check if you're stuck on components that don't have an added value for you but are used by default. 2. Consider eliminating the doc cache. It loads lots of (partly) lazy documents that their probability of secondary usage is low. There's no such thing popular docs when requesting so many docs. You may be using your memory in a better way. 3. Bottleneck check - inner server metrics as cpu user / iowait, packets transferred over the network, page faults etc. are excellent in order to understand if the disk/network/cpu is slowing you down. Then upgrade hardware in one of the shards to check if it helps by looking at the upgraded shard qTime compared to other. 4. Warm up the index after commiting - try to benchmark how do queries performs before and after some warm-up, let's say some few hundreds of queries (from your previous system) in order to warm up the os cache (assuming your using NRTDirectoryFactory) Good luck, Manu On Wed, Nov 13, 2013 at 2:38 PM, Erick Erickson erickerick...@gmail.comwrote: One thing you can try, and this is more diagnostic than a cure, is return just the id field (and insure that lazy field loading is true). That'll tell you whether the issue is actually fetching the document off disk and decompressing, although frankly that's unlikely since you can get your 5,000 rows from a single machine quickly. The code you found where Solr is spending its time, is that on the routing core or on the shards? I actually have a hard time understanding how that code could take a long time, doesn't seem right. You are transferring 5,000 docs across the network, so it's possible that your network is just slow, that's certainly a difference between the local and remote case, but that's a stab in the dark. Not much help I know, Erick On Wed, Nov 13, 2013 at 2:52 AM, Elran
Re: Solr Synonym issue
Hello! Could you please describe the issue you are having? -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ Hi Team, I had implemented solr with my magento enterprise edition. I am trying to implemented synonyms in solr but its not working.Please find attached for the synonyms.txt,schema.xml and solrconf.xml file. Since 2 days i am debugging the issue yet not finding any solution. Please help me as soon as possible. Hope i will get your reply as soon as possible. Thanks Regards Jyoti Kadam
Re: solrcloud - forward update to a shard failed
Thanks Michael. Followed your advice - no commits from indexing clients; let auto commit takes care of things. It worked, so far no errors. The config params needs some more tweaking to get the right balance, specifically maxTime, maxDocs and the soft commit interval, but otherwise sold is a lot more healthier... Thanks for your help. I did something like that also, and i was getting some nasty problems when one of my clients would try to commit before a commit issued by another one hadn't yet finish. Might be the same problem for you too. Try not doing explicit commits fomr the indexing client and instead set the autocommit to 1000 docs or whichever value fits you best. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-forward-update-to-a-shard-failed-tp4100608p4100670.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Updating Document Score With Payload of Multivalued Field?
Any ideas? 2013/11/13 Furkan KAMACI furkankam...@gmail.com PS: I use Solr 4.5.1 2013/11/13 Furkan KAMACI furkankam...@gmail.com Here is my case; I have a field at my schema named *elmo_field*. I want that *elmo_field* should have multiple values and multiple payloads. i.e. dorothy|0.46 sesame|0.37 big bird|0.19 bird|0.22 When a user searches for a keyword i.e. *dorothy* I want to add 0.46 to score. If user searches for *big bird *0.19 and if user searches for *bird *0.22 I mean I will make a search on my index at my other fields of solr schema. And I will make another search (this one is an exact match search) at *elmo_field* at same time and if matches something I will increase score with payloads. How can I do that: adding something to score at multivalued payload (with a nested query or not) and do you have any other ideas to achieve that?
Solr Release Management Process
Hi; I've asked the same question at dev-list but I could not get an answer. This question is related to Solr contributers too and I wanted to ask it here. solr-user list. My question was that: I've resolved 2 issues last week. One of them is created by me and one of them was an existence issue. Also there is an 3rd issue that is a duplication of the second one. When I create an issue I have a right to edit Fix Version/s. I've written 4.6 for fix version of first issue. Second issue was not created by me so I can not edit the Fix Version/s. I just wonder and want to learn commitment process of Solr project. What committers do before a new release process start? If they filter the resolved issues that has a Fix Version/s of new release they will not able to see resolved issues. If they filter the issues resolved since the last release then they are not using the benefits of Fix Version/s section. People have a right to edit Fix Version/s section when they create an issue but does not have a right to edit existence one (ones are created by other people) There are many issues at Solr project and frequent commits every day. Should I point the user at comments (with an @ tag) for such kind of situations (I follow who is responsible for next release from dev-list) or do you handle it yourself (as like how you handled it since this time). I just wanted to learn the internal process of release management. Thanks; Furkan KAMACI
Solr xml img parsing exception
Hi, I have installed a Solr 4.3 instance and we have configured manifoldcf to pass web content to the shard collection, but during the crawling we have noticed a lot of this exception: ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147) ... 24 more Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:846) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:775) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123) at
Re: My setup - init script and other info
Shawn: Would you be willing to put this on the Wiki? I think it'd be really useful to have it there... I'm pretty sure you have edit rights to the wiki, but they're free for the asking if not... Erick On Wed, Nov 13, 2013 at 1:07 PM, Shawn Heisey s...@elyograg.org wrote: In the hopes that it will help someone get Solr running in a very clean way, here's an informational email. For my Solr install on CentOS 6, I use /opt/solr4 as my installation path, and /index/solr4 as my solr home. The /index directory is a dedicated filesystem, /opt is part of the root filesystem. From the example directory, I copied cloud-scripts, contexts, etc, lib, webapps, and start.jar over to /opt/solr4. My stuff was created before 4.3.0, so the resources directory didn't exist. I was already using log4j with a custom Solr build, and I put my log4j.properties file in etc instead. I created a logs directory and a run directory in /opt/solr4. My data structure in /index/solr4 is complex. All a new user really needs to know is that solr.xml goes here and dictates the rest of the structure. There is a symlink at /index/solr4/lib, pointing to /opt/solr4/solrlib - so that jars placed in ${solr.solr.home}/lib are actually located in the program directory, not the data directory. That makes for a much cleaner version control scenario - both directories are git repositories cloned from our internal git server. Unlike the example configs, my solrconfig.xml files do not have lib directives for loading jars. That gets automatically handled by the jars living in that symlinked lib directory. See SOLR-4852 for caveats regarding central lib directories. https://issues.apache.org/jira/browse/SOLR-4852 If you want to run SolrCloud, you would need to install zookeeper separately and put your zkHost parameter in solr.xml. Due to a bug, putting zkHost in solr.xml doesn't work properly until 4.4.0. Here's the current state of my init script. It's redhat-specific. I used /bin/bash (instead of /bin/sh) in the shebang because I am pretty sure that there are bash-isms in it, and bash is always available on the systems that I use: http://apaste.info/9fVA Notable features: * Runs Solr as an unprivileged user. * Has three methods for stopping Solr, tries graceful methods first. 1) The jetty STOPPORT/STOPKEY mechanism. 2) PID saved by the 'start' action. 3) Any program using the Solr listening port. * Before killing by PID, tries to make sure that the process actually is Solr. * Sets up remote JMX, by default without authentication or SSL. * Highly tuned CMS garbage collection. * Sets up GC logging. * Virtually everything is overridable via /etc/sysconfig/solr4. * Points at an overridable log4j config file, by default in /opt/solr4/etc. * Removes the existing PID file if the server is just booting up -- which it knows by noting that server uptime is less than three minutes. It shouldn't be too hard to convert this so it works on debian-derived systems. That would involve rewriting portions that use redhat init routines, and probably start-stop-daemon. What I'd really like is one script that will work on any system, but that will require a fair amount of work. It's a work in progress. It should load log4j.properties from resources instead of etc. I'd like to include it in the Solr download, but without a fair amount of documentation and possibly an installation script, which still must be written, that won't be possible. Feel free to ask questions about anything that doesn't seem clear. I welcome ideas for improvement on both my own setup and the solr example. Thanks, Shawn
Re: Atomic Update at Solrj For a Newly Added Schema Field
I don't think this is a problem, what are you seeing? Have you tried it and get an error? The only reason you need to have fields stored is so _existing_ documents with _existing_ data gets into the new doc. Since you've just added a field, you should be fine. It's just that updating documents already in your index won't have any value in the new field unless you specifically add it in the new version. So yes, to get values in all of your _existing_ records you need to at least add all the docs again, in which case you might as well re-index. But if you can live with some of the docs not having the value, you shouldn't need to. If you're seeing other behavior tell us what you're seeing.. Best, Erick On Wed, Nov 13, 2013 at 1:10 PM, Furkan KAMACI furkankam...@gmail.comwrote: I use Solr 4.5.1 I have indexed some documents and decided to add a new field to my schema after a time later. I want to use Atomic Updates for that newly added field. I use Solrj for indexing. However due to there is no field named as I've newly added Solr does not make an atomic update for existing documents. I do not want to reindex my whole data. Any ideas for it?
Re: Using data-config.xml from DIH in SolrJ
There's nothing that I know of that takes a DIH configuration and uses it through SolrJ. You can use Tika directly in SolrJ if you need to parse structured documents though, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Yep, you're going to be kind of reinventing the wheel a bit I'm afraid. Best, Erick On Wed, Nov 13, 2013 at 1:55 PM, P Williams williams.tricia.l...@gmail.comwrote: Hi All, I'm building a utility (Java jar) to create SolrInputDocuments and send them to a HttpSolrServer using the SolrJ API. The intention is to find an efficient way to create documents from a large directory of files (where multiple files make one Solr document) and be sent to a remote Solr instance for update and commit. I've already solved the problem using the DataImportHandler (DIH) so I have a data-config.xml that describes the templated fields and cross-walking of the source(s) to the schema. The original data won't always be able to be co-located with the Solr server which is why I'm looking for another option. I've also already solved the problem using ant and xslt to create a temporary (and unfortunately a potentially large) document which the UpdateHandler will accept. I couldn't think of a solution that took advantage of the XSLT support in the UpdateHandler because each document is created from multiple files. Our current dated Java based solution significantly outperforms this solution in terms of disk and time. I've rejected it based on that and gone back to the drawing board. Does anyone have any suggestions on how I might be able to reuse my DIH configuration in the SolrJ context without re-inventing the wheel (or DIH in this case)? If I'm doing something ridiculous I hope you'll point that out too. Thanks, Tricia
Re: field collapsing performance in sharded environment
bq: Of the 10k docs, most have a unique near duplicate hash value, so there are about 10k unique values for the field that I'm grouping on. I suspect (but don't know the grouping code well) that this is the issue. You're getting the top N groups, right? But in the general case, you can't insure that the topN from shard1 has any relation to the topN from shard2. So I _suspect_ that the code returns all of the groups. Say that shard1 for group 5 has 3 docs, but for shard2 has 3,000 docs. Do get the true top N, you need to collate all the values from all the groups; you can't just return the top 10 groups from each shard and get correct counts. Since your group cardinality is about 10K/shard, you're pushing 10 packets each containing 10K entries back to the originating shard, which has to combine/sort them all to get the true top N. At least that's my theory. Your situation is special in that you say that your groups don't appear on more than one shard, so you'd probably have to write something that aborted this behavior and returned only the top N, if I'm right. But that begs the question of why you're doing this. What purpose is served by grouping on documents that probably only have 1 member? Best, Erick On Wed, Nov 13, 2013 at 2:46 PM, David Anthony Troiano dtroi...@basistech.com wrote: Hello, I'm hitting a performance issue when using field collapsing in a distributed Solr setup and I'm wondering if others have seen it and if anyone has an idea to work around. it. I'm using field collapsing to deduplicate documents that have the same near duplicate hash value, and deduplicating at query time (as opposed to filtering at index time) is a requirement. I have a sharded setup with 10 cores (not SolrCloud), each having ~1000 documents each. Of the 10k docs, most have a unique near duplicate hash value, so there are about 10k unique values for the field that I'm grouping on. The grouping parameters that I'm using are: group=true group.field=near dupe hash field group.main=true I'm attempting distributed queries (shards=s1,s2,...,s10) where the only difference is the absence or presence of these three grouping parameters and I'm consistently seeing a marked difference in performance (as a representative data point, 200ms latency without grouping and 1600ms with grouping). Interestingly, if I put all 10k docs on the same core and query that core independently with and without grouping, I don't see much of a latency difference, so the performance degradation seems to exist only in the sharded setup. Is there a known performance issue when field collapsing in a sharded setup (perhaps only manifests when the grouping field has many unique values), or have other people observed this? Any ideas for a workaround? Note that docs in my sharded setup can only have the same signature if they're in the same shard, so perhaps that can be used to boost perf, though I don't see an exposed way to do so. A follow-on question is whether we're likely to see the same issue if / when we move to SolrCloud. Thanks, Dave
Re: queries including time zone
IMO you will save yourself endless grief just biting the bullet and working with UTC at all times. The instant you have uses in even adjacent but different time zones, you'll have to deal with this anyway. FWIW, Erick On Thu, Nov 14, 2013 at 12:26 AM, Jack Krupansky j...@basetechnology.comwrote: I believe it is the TZ column from this table: http://en.wikipedia.org/wiki/List_of_tz_database_time_zones Yeah, it's on my TODO list for my book. I suspect that tz will not affect NOW, which is probably UTC. I suspect that tz only affects literal dates in date math. -- Jack Krupansky -Original Message- From: Eric Katherman Sent: Wednesday, November 13, 2013 11:38 PM To: solr-user@lucene.apache.org Subject: queries including time zone Can anybody provide any insight about using the tz param? The behavior of this isn't affecting date math and /day rounding. What format does the tz variables need to be in? Not finding any documentation on this. Sample query we're using: path=/select params={tz=America/Chicagosort=id+descstart=0q= application_id:51b30ed9bc571bd96773f09c+AND+object_key:object_26+AND+ values_field_215_date:[*+TO+NOW/DAY%2B1DAY]wt=jsonrows=25} Thanks! Eric=
Re: exceeded limit of maxWarmingSearchers ERROR
CommitWithin is either configured in solrconfig.xml for the autoCommit or autoSoftCommit tags as the maxTime tag. I recommend you do use this. The other way you can do it is if you're using SolrJ, one of the forms of the server.add() method takes a number of milliseconds to force a commit. You really, really do NOT want to use ridiculously short times for this like a few milliseconds. That will cause new searchers to be warmed, and when too many of them are warming at once you get this error. Seriously, make your commitWithin or autocommit parameters as long as you can, for many reasons. Here's a bunch of background: http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Thu, Nov 14, 2013 at 5:13 AM, Loka lokanadham.ga...@zensar.in wrote: Hi Naveen, Iam also getting the similar problem where I do not know how to use the commitWithin Tag, can you help me how to use commitWithin Tag. can you give me the example -- View this message in context: http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-ERROR-tp3252844p4100864.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimizing cores in SolrCloud
I'm going to answer with something completely different G First, though, optimization happens in the background, so it shouldn't have too big an impact on query performance outside of I/O contention. There also shouldn't be any problem with one shard being optimized and one not. Second, have you considered tweaking some of the TieredMergePolicy knobs? In particular. reclaimDeletesWeight which defaults to 2.0. You can set this in your solrconfig.xml. Through a clever bit of reflection, you can actually set most (all?) of the member vars in TieredMergePolicy.java. Bumping up the weight might cause the segment merges to merge-away the deleted docs frequently enough to satisfy you. Best, Erick On Thu, Nov 14, 2013 at 5:39 AM, michael.boom my_sky...@yahoo.com wrote: A few weeks ago optimization in SolrCloud was discussed in this thred: http://lucene.472066.n3.nabble.com/SolrCloud-optimizing-a-core-triggers-optimization-of-another-td4097499.html#a4098020 The thread was covering the distributed optimization inside a collection. My use case requires manually running optimizations every week or so, because I do delete by query often, and deletedDocs number gets to huge amounts, and the only way to regain that space is by optimizing. Since I have a pretty steady high load, I can't do it over night and i was thinking to do it one core at a time - meaning optimizing shard1_replica1 and then shard1_replica2 and so on, using curl ' http://localhost:8983/solr/collection1_shard1_replica1/update?optimize=truedistrib=false ' My question is how would this reflect on the performance of the system? All queries that would be routed to that shard replica would be very slow I assume. Would there be any problems if a replica is optimized and another is not? Anybody tried something like this? Any tips or stories ? Thank you! - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-cores-in-SolrCloud-tp4100871.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solrcloud - forward update to a shard failed
Here's a writeup on the interactions between a number of the parameters for soft/hard commits, NRT, and transaction logs. FWIW. http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Thu, Nov 14, 2013 at 8:22 AM, Aileen ail...@kriel.org wrote: Thanks Michael. Followed your advice - no commits from indexing clients; let auto commit takes care of things. It worked, so far no errors. The config params needs some more tweaking to get the right balance, specifically maxTime, maxDocs and the soft commit interval, but otherwise sold is a lot more healthier... Thanks for your help. I did something like that also, and i was getting some nasty problems when one of my clients would try to commit before a commit issued by another one hadn't yet finish. Might be the same problem for you too. Try not doing explicit commits fomr the indexing client and instead set the autocommit to 1000 docs or whichever value fits you best. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/solrcloud-forward-update-to-a-shard-failed-tp4100608p4100670.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr xml img parsing exception
It looks like bad data. The XML you're sending to Solr looks mal-formed, so I suspect this is completely outside of Solr's purview. Best, Erick On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote: Hi, I have installed a Solr 4.3 instance and we have configured manifoldcf to pass web content to the shard collection, but during the crawling we have noticed a lot of this exception: ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:150) at org.apache.solr.handler.ContentStreamHandlerBase. handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. handleRequest(RequestHandlers.java:242) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain. internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:107) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:76) at org.apache.catalina.valves.AccessLogValve.invoke( AccessLogValve.java:934) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:90) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:515) at org.apache.coyote.http11.AbstractHttp11Processor.process( AbstractHttp11Processor.java:1012) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. process(AbstractProtocol.java:642) at org.apache.coyote.http11.Http11NioProtocol$ Http11ConnectionHandler.process(Http11NioProtocol.java:223) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. doRun(NioEndpoint.java:1597) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. run(NioEndpoint.java:1555) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse( AutoDetectParser.java:120) at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:147) ... 24 more Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner. reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl.scanEndElement( XMLDocumentFragmentScannerImpl.java:1753) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl$FragmentContentDriver.next( XMLDocumentFragmentScannerImpl.java:2951) at com.sun.org.apache.xerces.internal.impl. XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl. XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl .java:511) at com.sun.org.apache.xerces.internal.parsers.
Re: Solr xml img parsing exception
Also there's a custom loader here that is the culprit: com.lsegroup.solr.handler.CwsExtractingDocumentLoader On Nov 14, 2013, at 10:20, Erick Erickson erickerick...@gmail.com wrote: It looks like bad data. The XML you're sending to Solr looks mal-formed, so I suspect this is completely outside of Solr's purview. Best, Erick On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote: Hi, I have installed a Solr 4.3 instance and we have configured manifoldcf to pass web content to the shard collection, but during the crawling we have noticed a lot of this exception: ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:150) at org.apache.solr.handler.ContentStreamHandlerBase. handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. handleRequest(RequestHandlers.java:242) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain. internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:107) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:76) at org.apache.catalina.valves.AccessLogValve.invoke( AccessLogValve.java:934) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:90) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:515) at org.apache.coyote.http11.AbstractHttp11Processor.process( AbstractHttp11Processor.java:1012) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. process(AbstractProtocol.java:642) at org.apache.coyote.http11.Http11NioProtocol$ Http11ConnectionHandler.process(Http11NioProtocol.java:223) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. doRun(NioEndpoint.java:1597) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. run(NioEndpoint.java:1555) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse( AutoDetectParser.java:120) at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:147) ... 24 more Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner. reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl.scanEndElement( XMLDocumentFragmentScannerImpl.java:1753) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl$FragmentContentDriver.next( XMLDocumentFragmentScannerImpl.java:2951) at com.sun.org.apache.xerces.internal.impl. XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl. XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) at
Re: Optimizing cores in SolrCloud
Thanks Erick! That's a really interesting idea, i'll try it! Another question would be, when does the merging actually happens? Is it triggered or conditioned by something? Currently I have a core with ~13M maxDocs and ~3M deleted docs, and although I see a lot of merges in SPM, deleted documents aren't really going anywhere. For merging I have the example settings, haven't changed it. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-cores-in-SolrCloud-tp4100871p4100936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query on multi valued field
Hi, I want to search in a multivalued field. For example, my field FormIds contains (1,2,3) as comma separated. If i search for 1 or (1,2) or (1,3) or (2,3) or (1,2,3) any combination like this should work. How to define this multivalued integer field type. Thankyou. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-on-multi-valued-field-tp3209343p4100937.html Sent from the Solr - User mailing list archive at Nabble.com.
Document routing question.
Hi, I read this post http://searchhub.org/2013/06/13/solr-cloud-document-routing and I have some questions. When a tenant is too large to fit on one shard, we can specify the number of bit from the shardKey that we want to use. If we set a doc's key as tenant1/4!docXXX we are saying to spread the docs over the 1/4th of the collection. If the collection has 4 shards this means that all docs with the same shardKey will go to the same shard, or we will spread 25% in each shard? Other question is: at query time, we must configurate shardKeys param as shard.keys=tenant1! or as shard.keys=tenant1/4! /Yago - Best regards -- View this message in context: http://lucene.472066.n3.nabble.com/Document-routing-question-tp4100938.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query on multi valued field
On Thu, Nov 14, 2013, at 03:45 PM, giridhar wrote: Hi, I want to search in a multivalued field. For example, my field FormIds contains (1,2,3) as comma separated. If i search for 1 or (1,2) or (1,3) or (2,3) or (1,2,3) any combination like this should work. How to define this multivalued integer field type. Surely this is how multivalued fields work. If you had an integer field type, that is defined as multiValued=true, then you can have three values in that field, 1, 2 and 3. Then, if you query for FormIds:(1 AND 2) will return all documents that have both 1 and 2 in that field. Am I missing something? Upayavira
Re: Solr xml img parsing exception
Hi Erik, but in this case the custom loader receives an HTTP Error 500 by SOLR? Thanks, Marcello On 11/14/2013 04:29 PM, Erik Hatcher wrote: Also there's a custom loader here that is the culprit: com.lsegroup.solr.handler.CwsExtractingDocumentLoader On Nov 14, 2013, at 10:20, Erick Erickson erickerick...@gmail.com wrote: It looks like bad data. The XML you're sending to Solr looks mal-formed, so I suspect this is completely outside of Solr's purview. Best, Erick On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi mlore...@sorint.itwrote: Hi, I have installed a Solr 4.3 instance and we have configured manifoldcf to pass web content to the shard collection, but during the crawling we have noticed a lot of this exception: ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:150) at org.apache.solr.handler.ContentStreamHandlerBase. handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest( RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. handleRequest(RequestHandlers.java:242) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute( SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter( SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain. internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter( ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke( StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke( StandardContextValve.java:107) at org.apache.catalina.core.StandardHostValve.invoke( StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke( ErrorReportValve.java:76) at org.apache.catalina.valves.AccessLogValve.invoke( AccessLogValve.java:934) at org.apache.catalina.core.StandardEngineValve.invoke( StandardEngineValve.java:90) at org.apache.catalina.connector.CoyoteAdapter.service( CoyoteAdapter.java:515) at org.apache.coyote.http11.AbstractHttp11Processor.process( AbstractHttp11Processor.java:1012) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. process(AbstractProtocol.java:642) at org.apache.coyote.http11.Http11NioProtocol$ Http11ConnectionHandler.process(Http11NioProtocol.java:223) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. doRun(NioEndpoint.java:1597) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. run(NioEndpoint.java:1555) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse( CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse( AutoDetectParser.java:120) at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( CwsExtractingDocumentLoader.java:147) ... 24 more Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl. XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner. reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl.scanEndElement( XMLDocumentFragmentScannerImpl.java:1753) at com.sun.org.apache.xerces.internal.impl. XMLDocumentFragmentScannerImpl$FragmentContentDriver.next( XMLDocumentFragmentScannerImpl.java:2951) at com.sun.org.apache.xerces.internal.impl. XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.
Re: Solr xml img parsing exception
The actual error appears to be: Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. So, check the input document at line 91, column 105. There should be an img tag there, but SAX is complaining that there is no matching /img. -- Jack Krupansky -Original Message- From: Marcello Lorenzi Sent: Thursday, November 14, 2013 9:26 AM To: solr-user@lucene.apache.org Subject: Solr xml img parsing exception Hi, I have installed a Solr 4.3 instance and we have configured manifoldcf to pass web content to the shard collection, but during the crawling we have noticed a lot of this exception: ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: XML parse error at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:150) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:242) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:221) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:107) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:76) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:934) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:90) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:515) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1012) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:642) at org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:223) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1597) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:1555) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: org.apache.tika.exception.TikaException: XML parse error at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load(CwsExtractingDocumentLoader.java:147) ... 24 more Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: 105; The element type img must be terminated by the matching end-tag /img. at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:198) at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:441) at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:368) at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1388) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1753) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2951) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) at
Re: Optimizing cores in SolrCloud
Earlier, you said that optimize is the only way that deleted documents are expunged. That is false. They are expunged when the segment they are in is merged. A forced merge (optimize) merges all segments, so will expunge all deleted document. But those documents will be expunged by merges eventually. When you have deleted docs in the largest segment, you have to wait for a merge of that segment. My best advice is to stop looking at the deleted documents count and worry about something that makes a difference to your users. For about 10 years, I worked on Ultraseek Server, a search engine with the same design for merging and document deletion. With over 10K installations, we never had a customer who had a problem caused by deleted documents. wunder On Nov 14, 2013, at 7:41 AM, michael.boom my_sky...@yahoo.com wrote: Thanks Erick! That's a really interesting idea, i'll try it! Another question would be, when does the merging actually happens? Is it triggered or conditioned by something? Currently I have a core with ~13M maxDocs and ~3M deleted docs, and although I see a lot of merges in SPM, deleted documents aren't really going anywhere. For merging I have the example settings, haven't changed it. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Optimizing-cores-in-SolrCloud-tp4100871p4100936.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query on multi valued field
I suppose you could define the field as tokenized text with the work delimiter filter and with autogeneratePhraseQueries=false and the default query operator set to OR, and then queries might just work close enough to what you want. Otherwise... You could do a custom update processor that parsed the string and expands it into separate integer values for a multivalued field, and then you would need to do either a custom query parser or a query preprocessor that exapanded that special syntax into normal Solr query syntax using AND or OR as desired. You could implement the update processor as a JavaScript script. The simplest approach to the query side would be to expand the special query syntax in your application layer. -- Jack Krupansky -Original Message- From: giridhar Sent: Thursday, November 14, 2013 10:45 AM To: solr-user@lucene.apache.org Subject: Re: Query on multi valued field Hi, I want to search in a multivalued field. For example, my field FormIds contains (1,2,3) as comma separated. If i search for 1 or (1,2) or (1,3) or (2,3) or (1,2,3) any combination like this should work. How to define this multivalued integer field type. Thankyou. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-on-multi-valued-field-tp3209343p4100937.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query on multi valued field
s/work/word/ word delimiter filter -- Jack Krupansky -Original Message- From: Jack Krupansky Sent: Thursday, November 14, 2013 11:34 AM To: solr-user@lucene.apache.org Subject: Re: Query on multi valued field I suppose you could define the field as tokenized text with the work delimiter filter and with autogeneratePhraseQueries=false and the default query operator set to OR, and then queries might just work close enough to what you want. Otherwise... You could do a custom update processor that parsed the string and expands it into separate integer values for a multivalued field, and then you would need to do either a custom query parser or a query preprocessor that exapanded that special syntax into normal Solr query syntax using AND or OR as desired. You could implement the update processor as a JavaScript script. The simplest approach to the query side would be to expand the special query syntax in your application layer. -- Jack Krupansky -Original Message- From: giridhar Sent: Thursday, November 14, 2013 10:45 AM To: solr-user@lucene.apache.org Subject: Re: Query on multi valued field Hi, I want to search in a multivalued field. For example, my field FormIds contains (1,2,3) as comma separated. If i search for 1 or (1,2) or (1,3) or (2,3) or (1,2,3) any combination like this should work. How to define this multivalued integer field type. Thankyou. -- View this message in context: http://lucene.472066.n3.nabble.com/Query-on-multi-valued-field-tp3209343p4100937.html Sent from the Solr - User mailing list archive at Nabble.com.
facet method=enum and uninvertedfield limitations
I am running into performance problems with faceted queries. If I do a q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0 I am getting an exception: org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384) at org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839) ... I understand it's got something to do with a 24bit limit somewhere in the code but I don't understand enough of it to be able to construct a specialized index that can be queried with facet.method=enum. A stripped down index still doesn't work. It has exactly one field CONTENT with 178,000 Terms and ~1 mio documents. The top ranking terms according to Luke are 1 413950CONTENT word1 2 321223CONTENT word2 3 299036CONTENT word3 4 276757CONTENT word4 ... How would we have to strip the index? Thanks, Michael
Re: queries including time zone
: Can anybody provide any insight about using the tz param? The behavior : of this isn't affecting date math and /day rounding. What format does : the tz variables need to be in? Not finding any documentation on this. it's not tz it's TZ The input/output format is always in UTC, but TZ will affect all of the date math... https://wiki.apache.org/solr/CoreQueryParameters#TZ -Hoss
Re: Using data-config.xml from DIH in SolrJ
Hi, I just discovered UpdateProcessorFactoryhttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/package-summary.html in a big way. How did this completely slip by me? Working on two ideas. 1. I have used the DIH in a local EmbeddedSolrServer previously. I could write a ForwardingUpdateProcessorFactory to take that local update and send it to a HttpSolrServer. 2. I have code which walks the file-system to compose rough documents but haven't yet written the part that handles the templated fields and cross-walking of the source(s) to the schema. I could configure the update handler on the Solr server side to do this with the RegexReplace http://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/RegexReplaceProcessorFactory.htmland DefaultValuehttp://lucene.apache.org/solr/4_5_1/solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html UpdateProcessorFactor(ies). Any thoughts on the advantages/disadvantages of these approaches? Thanks, Tricia On Thu, Nov 14, 2013 at 7:49 AM, Erick Erickson erickerick...@gmail.comwrote: There's nothing that I know of that takes a DIH configuration and uses it through SolrJ. You can use Tika directly in SolrJ if you need to parse structured documents though, see: http://searchhub.org/2012/02/14/indexing-with-solrj/ Yep, you're going to be kind of reinventing the wheel a bit I'm afraid. Best, Erick On Wed, Nov 13, 2013 at 1:55 PM, P Williams williams.tricia.l...@gmail.comwrote: Hi All, I'm building a utility (Java jar) to create SolrInputDocuments and send them to a HttpSolrServer using the SolrJ API. The intention is to find an efficient way to create documents from a large directory of files (where multiple files make one Solr document) and be sent to a remote Solr instance for update and commit. I've already solved the problem using the DataImportHandler (DIH) so I have a data-config.xml that describes the templated fields and cross-walking of the source(s) to the schema. The original data won't always be able to be co-located with the Solr server which is why I'm looking for another option. I've also already solved the problem using ant and xslt to create a temporary (and unfortunately a potentially large) document which the UpdateHandler will accept. I couldn't think of a solution that took advantage of the XSLT support in the UpdateHandler because each document is created from multiple files. Our current dated Java based solution significantly outperforms this solution in terms of disk and time. I've rejected it based on that and gone back to the drawing board. Does anyone have any suggestions on how I might be able to reuse my DIH configuration in the SolrJ context without re-inventing the wheel (or DIH in this case)? If I'm doing something ridiculous I hope you'll point that out too. Thanks, Tricia
Re: facet method=enum and uninvertedfield limitations
On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael SZ/HZA-ZSW lemke...@schaeffler.com wrote: I am running into performance problems with faceted queries. If I do a q=wordfacet.field=CONTENTfacet=truefacet.limit=10facet.mincount=1facet.method=fcfacet.prefix=arows=0 I am getting an exception: org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT at org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384) at org.apache.solr.request.UnInvertedField.lt;initgt;(UnInvertedField.java:178) at org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839) ... I understand it's got something to do with a 24bit limit somewhere in the code but I don't understand enough of it to be able to construct a specialized index that can be queried with facet.method=enum. You shouldn't need to do anything differently to try facet.method=enum (just replace facet.method=fc with facet.method=enum) You may also want to add the parameter facet.enum.cache.minDf=10 to lower memory usage by only usiing the filter cache for terms that match more than 100K docs. -Yonik http://heliosearch.com -- making solr shine
SOLR DIH not indexing NFS share
I have SOLR with DIH using TIKA running fine on a local directory. It imports the data fine. I need it to work on an NFS mounted directory however, and it fails when I change it to use that. The tomcat6 user has access to the NFS mount (ls returns all files any way). The mount is NFS v3, if that matters. I've changed the tomcat's uid to match the tomcat user on the NFS server. Can anyone point me in the right direction for why this isn't fetching any files? I get this while indexing: Requests: 0, Fetched: 1, Skipped: 0, Processed: 0 Here are the SOLR logs: 824741 [commitScheduler-6-thread-1] INFO org.apache.solr.update.UpdateHandler – No uncommitted changes. Skipping IW.commit. 824745 [commitScheduler-6-thread-1] INFO org.apache.solr.update.UpdateHandler – end_commit_flush 853486 [http-8080-1] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/dataimport params={optimize=falseindent=trueclean=truecommit=trueverbose=trueentity=fcommand=full-importdebug=truewt=json} {deleteByQuery=*:* (-1451628963612852224)} 0 44428 853488 [http-8080-1] ERROR org.apache.solr.handler.dataimport.DataImporter – Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:410) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231) ... 20 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:539) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408) ... 22 more Caused by: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DebugLogger.log(DebugLogger.java:140) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:537) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495) ... 23 more 853489 [http-8080-1] INFO org.apache.solr.update.UpdateHandler – start rollback{} 853498 [http-8080-1] INFO org.apache.solr.update.DefaultSolrCoreState – Creating new IndexWriter... 853498 [http-8080-1] INFO org.apache.solr.update.DefaultSolrCoreState – Waiting until IndexWriter is unused... core=collection1 853498 [http-8080-1] INFO org.apache.solr.update.DefaultSolrCoreState – Rollback old IndexWriter... core=collection1 853509 [http-8080-1] INFO org.apache.solr.core.SolrCore – SolrDeletionPolicy.onInit: commits: num=1
Re: Boosting documents by categorical preferences
: I have a question around boosting. I wanted to use the boost= to write a : nested query that will boost a document based on categorical preferences. You have no idea how stoked I am to see you working on this in a real world application. : Currently I have the weights set to the z-score equivalent of a user's : preference for that category which is simply how many standard deviations : above the global average is this user's preference for that movie category. : : My question though is basically whether or not semantically the equation : query(category:Drama)*some weight + query(category:Comedy)*some weight : + query(category:Action)*some weight makes sense? My gut says that your apprach makes sense -- but if i'm understadning you correclty, i think that you need to add 1 to all your weights: the boost is a multiplier, so if someone's rating for every category is is 0 std devs above the average rating (ie: the most average person imaginable), you don't wnat to give every moving in every category a score of 0. Are you picking the top 3 categories the user prefers as a cut off, or are you arbitrarily using N category boosts for however many N categories the user is above the global average in their pref for that category? Are your prefrences coming from explicit user feedback on the categories (ie: rate how much you like comedies on a scale of 1-5) or are you infering it from user ratings of the movies themselves? (ie: rate this movie, which happens to be an scifi,action,comedy, on a scale of 1-5) ... because if it's hte later you probably want to be careful to also normalize based on how many categories the movie is in. the other thing to consider is wether you want to include negative prefrences (ie: weights less then 1) based on how many std dev the user's average is *below* the global average for a category .. in this case i *think* you'd want to divide the raw value from -1 to get a useful multiplier. Alternatively: you oculd experiment with using the weights as exponents instead of multipliers... b=sum(pow(query($cat1),1.482),pow(query($cat2),0.1199),pow(query($cat3),1.448)) ...that would simplify the math you'd have to worry about both for the totally boring average user (x**0 = 1) and for the categories users hate (x**-5 = some positive fraction that will act as a penalty) ... but you'd definitley need to run some tests to see if it over boosts as the std dev variations get really high (might want to take a root first before using them as the exponent) -Hoss
Re: My setup - init script and other info
On 11/14/2013 7:43 AM, Erick Erickson wrote: Shawn: Would you be willing to put this on the Wiki? I think it'd be really useful to have it there... I'm pretty sure you have edit rights to the wiki, but they're free for the asking if not... Done. To make it more obvious that it's not an officially sanctioned script at this time, I've put it on my personal wiki page. https://wiki.apache.org/solr/ShawnHeisey#Init_script Thanks, Shawn
Re: queries including time zone
I've beefed up the ref guide page on dates to include more info about all of this... https://cwiki.apache.org/confluence/display/solr/Working+with+Dates -Hoss
RE: My setup - init script and other info
its worth pointing out there are init scripts for jetty which can be pulled from its regular distribution site and added to a solr installation with only minor modifications i do this with my rpm build process (i just pushed the updates for 4.5.1 release) https://github.com/boogieshafer/jetty-solr-rpm you then put the JVM settings and solr specific variables in /etc/default/jetty (the regular jetty init scripts looks for this file) the init script, modular JMX and request log configurations are all things i borrow from the mainline jetty which are stripped out by the existing solr packaging of the embedded jetty and IMO are worth adding back in for a production deployment From: Palmer, Eric epal...@richmond.edu Sent: Wednesday, November 13, 2013 10:09 To: solr-user@lucene.apache.org Cc: solr-user@lucene.apache.org Subject: Re: My setup - init script and other info Thank you. This will help me a lot. Sent from my iPhone On Nov 13, 2013, at 10:08 AM, Shawn Heisey s...@elyograg.org wrote: In the hopes that it will help someone get Solr running in a very clean way, here's an informational email. For my Solr install on CentOS 6, I use /opt/solr4 as my installation path, and /index/solr4 as my solr home. The /index directory is a dedicated filesystem, /opt is part of the root filesystem. From the example directory, I copied cloud-scripts, contexts, etc, lib, webapps, and start.jar over to /opt/solr4. My stuff was created before 4.3.0, so the resources directory didn't exist. I was already using log4j with a custom Solr build, and I put my log4j.properties file in etc instead. I created a logs directory and a run directory in /opt/solr4. My data structure in /index/solr4 is complex. All a new user really needs to know is that solr.xml goes here and dictates the rest of the structure. There is a symlink at /index/solr4/lib, pointing to /opt/solr4/solrlib - so that jars placed in ${solr.solr.home}/lib are actually located in the program directory, not the data directory. That makes for a much cleaner version control scenario - both directories are git repositories cloned from our internal git server. Unlike the example configs, my solrconfig.xml files do not have lib directives for loading jars. That gets automatically handled by the jars living in that symlinked lib directory. See SOLR-4852 for caveats regarding central lib directories. https://issues.apache.org/jira/browse/SOLR-4852 If you want to run SolrCloud, you would need to install zookeeper separately and put your zkHost parameter in solr.xml. Due to a bug, putting zkHost in solr.xml doesn't work properly until 4.4.0. Here's the current state of my init script. It's redhat-specific. I used /bin/bash (instead of /bin/sh) in the shebang because I am pretty sure that there are bash-isms in it, and bash is always available on the systems that I use: http://apaste.info/9fVA Notable features: * Runs Solr as an unprivileged user. * Has three methods for stopping Solr, tries graceful methods first. 1) The jetty STOPPORT/STOPKEY mechanism. 2) PID saved by the 'start' action. 3) Any program using the Solr listening port. * Before killing by PID, tries to make sure that the process actually is Solr. * Sets up remote JMX, by default without authentication or SSL. * Highly tuned CMS garbage collection. * Sets up GC logging. * Virtually everything is overridable via /etc/sysconfig/solr4. * Points at an overridable log4j config file, by default in /opt/solr4/etc. * Removes the existing PID file if the server is just booting up -- which it knows by noting that server uptime is less than three minutes. It shouldn't be too hard to convert this so it works on debian-derived systems. That would involve rewriting portions that use redhat init routines, and probably start-stop-daemon. What I'd really like is one script that will work on any system, but that will require a fair amount of work. It's a work in progress. It should load log4j.properties from resources instead of etc. I'd like to include it in the Solr download, but without a fair amount of documentation and possibly an installation script, which still must be written, that won't be possible. Feel free to ask questions about anything that doesn't seem clear. I welcome ideas for improvement on both my own setup and the solr example. Thanks, Shawn
Group and Field Collapsing in SOLR More like this
Hi I have two types of profile : Shadow and DO and I am trying to use MLT to bring related recommendation of a userID In the result I get both the types but I want to restrict the results of document through a field (type) I pass it on. Currently grouping and field collapsing does not seem to work. Any other way to achieve it Thanks Balaji -- View this message in context: http://lucene.472066.n3.nabble.com/Group-and-Field-Collapsing-in-SOLR-More-like-this-tp4101032.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date range faceting with various gap sizes?
: I'm experimenting with date range faceting, and would like to use : different gaps depending on how old the date is. But I am not sure on : how to do that. What you are trying to do is possible, but the SolrJ helper methods you are using predates the ability and doesn't currently work the way it should... : solrQuery.addDateRangeFacet(scheduledate_start_tdate, date1, date2, +1YEAR); : solrQuery.addDateRangeFacet(scheduledate_start_tdate, date3, date4, +1MONTH); the addDateRangeFacet method you are calling is just syntactic sugar for the add(String,String) method called on the various params: facet.range, facet.range.start, etc You can see that in the resulting URL you got the params are duplicated -- the problem is that when expressed this way, Solr doesn't know when the different values of the start/end/gap params should be applied -- it just loops over each of the facet.range fields (in your case: the same field twice) and then looks for a coorisponding start/end/gap value and finds the first one since there are duplicates. what you want to do can be accomplished (as of Solr 4.3 - see SOLR-1351) by using local params in the facet.range (or facet.date) params... http://localhost:8983/solr/select?q=*:*rows=0facet=truefacet.range={!facet.range.start=NOW/MONTH%20facet.range.end=NOW/MONTH%2B1MONTH%20facet.range.gap=%2B1DAY}manufacturedate_dtfacet.range={!facet.range.start=NOW/MONTH%20facet.range.end=NOW/MONTH%2B1MONTH%20facet.range.gap=%2B5DAY}manufacturedate_dt I've opened a new issue to track fixing these sugar methods -- patches to improve this would certainly be welcome, but note that it regardless of hte SolrJ behavior you'll need to upgrade to at least Solr 4.3 for the server side piece to work, and you cna work arround hte client side behavior by calling add(String,String) directly. https://issues.apache.org/jira/browse/SOLR-5443 -Hoss
Document Security Model Question
I had earlier posted a similar discussion in LinkedIn and David Smiley rightly advised me that solr-user is a better place for technical discussions -- Our product which is hosted supports searching on educational resources. Our customers can choose to make specific resources unavailable for their users and also it depends on licensing. Our current solution uses full text search support in the database and handles availability as part of sql . My task is to move the search from the database full text search into Solr. I searched through posts and found some that were kind of related and I am thinking along the following lines a) Use the authorization model. I can add fields like allow and/or deny in the index which contain the list of customers. At query time, I can add the constraint based on the customer Id. I am concerned about the performance if there are lot of values for these fields and also it requires constant reindexing if a value in this field changes b) Use Query-time Join. Have the resource to availability for customer in separate inner documents. We are planning to deploy in SolrCloud. I have read some challenges about Query-time join and SolrCloud. So this may not work for us. c) Other ideas? Excerpts from David Smiley's response You're right that there may be some re-indexing as security rules change. If many Lucene/Solr documents share identical access control with other documents, then it may make more sense to externally determine which unique set of access-control sets the user has access to, then finally search by id -- which will hopefully not be a huge number. I've seen this done both externally and with a Solr core to join on. -- View this message in context: http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR DIH not indexing NFS share
At a quick glance at the very first error: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception Looks like you have some weird jars in your classpath and/or are using a strange version of Java. But that's just a guess. Erick On Thu, Nov 14, 2013 at 1:57 PM, tegryan t...@jostle.me wrote: I have SOLR with DIH using TIKA running fine on a local directory. It imports the data fine. I need it to work on an NFS mounted directory however, and it fails when I change it to use that. The tomcat6 user has access to the NFS mount (ls returns all files any way). The mount is NFS v3, if that matters. I've changed the tomcat's uid to match the tomcat user on the NFS server. Can anyone point me in the right direction for why this isn't fetching any files? I get this while indexing: Requests: 0, Fetched: 1, Skipped: 0, Processed: 0 Here are the SOLR logs: 824741 [commitScheduler-6-thread-1] INFO org.apache.solr.update.UpdateHandler – No uncommitted changes. Skipping IW.commit. 824745 [commitScheduler-6-thread-1] INFO org.apache.solr.update.UpdateHandler – end_commit_flush 853486 [http-8080-1] INFO org.apache.solr.update.processor.LogUpdateProcessor – [collection1] webapp=/solr path=/dataimport params={optimize=falseindent=trueclean=truecommit=trueverbose=trueentity=fcommand=full-importdebug=truewt=json} {deleteByQuery=*:* (-1451628963612852224)} 0 44428 853488 [http-8080-1] ERROR org.apache.solr.handler.dataimport.DataImporter – Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:270) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476) at org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:179) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1904) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:659) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:362) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:679) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:410) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231) ... 20 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:539) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408) ... 22 more Caused by: java.lang.ClassCastException: java.lang.NoClassDefFoundError cannot be cast to java.lang.Exception at org.apache.solr.handler.dataimport.DebugLogger.log(DebugLogger.java:140) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:537) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:495) ... 23 more 853489 [http-8080-1] INFO org.apache.solr.update.UpdateHandler – start rollback{} 853498 [http-8080-1]
Re: SOLR DIH not indexing NFS share
Hi Erick, I appreciate the answer. I just found out that it's failing on a .mov file with that error. I also noticed that I load the log4j.jar's twice, so I'm wondering if the wrong class loader is loading the logging and that's why it's giving me an unhelpful message. I've excluded .mov files for now since they can't be indexed anyway, and will look at why the logging is not working. Thanks again. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-DIH-not-indexing-NFS-share-tp4100998p4101096.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: queries including time zone
We're still not seeing the proper result.I've included a gist of the query and its debug result. This was run on a clean index running 4.4.0 with just one document. That document has a date of 11/15/2013 yet the date in the included TZ it is the 14th but I still get that document returned. Hoping someone can help. https://gist.github.com/anonymous/7478773 On Nov 14, 2013, at 3:06 PM, Chris Hostetter hossman_luc...@fucit.org wrote: I've beefed up the ref guide page on dates to include more info about all of this... https://cwiki.apache.org/confluence/display/solr/Working+with+Dates -Hoss
Re: Document Security Model Question
Hi, For the case: *it requires *constant reindexing if a value in this field changes If the acl for documents keep changing, Solr PostFilter is one of the option. We use it in our system. We have almost near to billion documents and 5000 approx users. But it is important to check whether the acl changes are frequent and decide solution based on that. The first option in your list works efficiently without effecting search performance. In case the value changes are less frequent then re-indexing of only those documents should not be the concern. But then, If changes are frequent, Post filter can be used and will add some amount of delay. Thanks On Fri, Nov 15, 2013 at 4:32 AM, kchellappa kannan.chella...@gmail.comwrote: I had earlier posted a similar discussion in LinkedIn and David Smiley rightly advised me that solr-user is a better place for technical discussions -- Our product which is hosted supports searching on educational resources. Our customers can choose to make specific resources unavailable for their users and also it depends on licensing. Our current solution uses full text search support in the database and handles availability as part of sql . My task is to move the search from the database full text search into Solr. I searched through posts and found some that were kind of related and I am thinking along the following lines a) Use the authorization model. I can add fields like allow and/or deny in the index which contain the list of customers. At query time, I can add the constraint based on the customer Id. I am concerned about the performance if there are lot of values for these fields and also it requires constant reindexing if a value in this field changes b) Use Query-time Join. Have the resource to availability for customer in separate inner documents. We are planning to deploy in SolrCloud. I have read some challenges about Query-time join and SolrCloud. So this may not work for us. c) Other ideas? Excerpts from David Smiley's response You're right that there may be some re-indexing as security rules change. If many Lucene/Solr documents share identical access control with other documents, then it may make more sense to externally determine which unique set of access-control sets the user has access to, then finally search by id -- which will hopefully not be a huge number. I've seen this done both externally and with a Solr core to join on. -- View this message in context: http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr spatial search within the polygon
Hi, I'm experimenting with solr spatial search, with plotting points in the map (Latitude and longitude) and based on the value I need to get the result. As the first step I've defined the filed type as fieldType name=location_rpt class=solr.SpatialRecursivePrefixTreeFieldType spatialContextFactory=com.spatial4j.core.context.jts.JtsSpatialContextFactory distErrPct=0.025 maxDistErr=0.09 units=degrees / And then added the field *location* as type *location_rpt* field name=location type=location_rpt indexed=true stored=true multiValued=true / Indexed the location filed as $latitude, $longitude So that data in the *location* will be like location:[9.445890,76.540970] Next I draw a polygon in google map and collected the lat and lng coordinates of the polygon. It will be like 9.472992 76.540817, 9.441328 76.523651 , 9.433708 76.555065 , 9.458092 76.572403, 9.472992 76.540817 Based on this coordinates I performed a query in solr like this localhost:8983/solr/ha_poc/select?fl=id,name,district,localitywt=json json.nl=mapq=*:*fq=location:IsWithin(POLYGON((9.472992 76.540817, 9.441328 76.523651 , 9.433708 76.555065 , 9.458092 76.572403, 9.472992 76.540817))) distErrPct=0 But I didn't get the result from the solr as I expected. { - responseHeader: { - status: 0, - QTime: 2 }, - response: { - numFound: 0, - start: 0, - docs: [ ] } } Is there anything that I missed ??? Can anybody help me in solving this issue with solr spatial search. I'm using Solr 4.4.0 Added an additional dependency JTS jar file for the support of polygon /lib/ext/jts-1.13.jar -- *dhanesh s.r*
Re: exceeded limit of maxWarmingSearchers ERROR
Hi Erickson, Thanks for your reply, basically, I used commitWithin tag as below in solrconfig.xml file requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.processordedupe/str /lst add commitWithin=1/ /requestHandler updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool str name=signatureFieldid/str bool name=overwriteDupesfalse/bool str name=fieldsname,features,cat/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain But this fix did not solve my problem, I mean I again got the same error. PFA of schema.xml and solrconfig.xml file, solr-spring.xml, messaging-spring.xml, can you sugest me where Iam doing wrong. Regards, Lokanadham Ganta - Original Message - From: Erick Erickson [via Lucene] ml-node+s472066n4100924...@n3.nabble.com To: Loka lokanadham.ga...@zensar.in Sent: Thursday, November 14, 2013 8:38:17 PM Subject: Re: exceeded limit of maxWarmingSearchers ERROR CommitWithin is either configured in solrconfig.xml for the autoCommit or autoSoftCommit tags as the maxTime tag. I recommend you do use this. The other way you can do it is if you're using SolrJ, one of the forms of the server.add() method takes a number of milliseconds to force a commit. You really, really do NOT want to use ridiculously short times for this like a few milliseconds. That will cause new searchers to be warmed, and when too many of them are warming at once you get this error. Seriously, make your commitWithin or autocommit parameters as long as you can, for many reasons. Here's a bunch of background: http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best, Erick On Thu, Nov 14, 2013 at 5:13 AM, Loka [hidden email] wrote: Hi Naveen, Iam also getting the similar problem where I do not know how to use the commitWithin Tag, can you help me how to use commitWithin Tag. can you give me the example -- View this message in context: http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-ERROR-tp3252844p4100864.html Sent from the Solr - User mailing list archive at Nabble.com. If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-ERROR-tp3252844p4100924.html To unsubscribe from exceeded limit of maxWarmingSearchers ERROR, click here . NAML solr-spring.xml (2K) http://lucene.472066.n3.nabble.com/attachment/4101152/0/solr-spring.xml messaging-spring.xml (2K) http://lucene.472066.n3.nabble.com/attachment/4101152/1/messaging-spring.xml schema.xml (6K) http://lucene.472066.n3.nabble.com/attachment/4101152/2/schema.xml solrconfig.xml (61K) http://lucene.472066.n3.nabble.com/attachment/4101152/3/solrconfig.xml -- View this message in context: http://lucene.472066.n3.nabble.com/exceeded-limit-of-maxWarmingSearchers-ERROR-tp3252844p4101152.html Sent from the Solr - User mailing list archive at Nabble.com.
An UpdateHandler to run following a MySql DataImport
Hi All, I have written a custom update request handler to do some custom processing of documents and configured the /update handler to use my custom handler in the default: update.chain. The same requestHandler should be configured for the data-import-handler when it loads documents to solr index. Is there a way configure the dataimport handler to use my custom updatehandler in a update.chain? If not how can I perform the required custom processing of the document while importing data from a mysql database? Thanks, Dileepa