Re: HTML Indexing error
On 18 April 2012 00:41, Chambeda chamb...@gmail.com wrote: Hi All, I am trying to parse some text that contains embedded HTML elements and am getting the following error: [...] According to the documentation the br should be removed correctly. Anything I am missing? How are you indexing the XML documents? Using DIH? If so, please show us the DIH configuration file. Regards, Gora
Re: searching and text highlighting
rpc29y wrote Good afternoon: I would like to know if it can be indexed with SolR word documents or pdf. Yes, you may first look at Tika Solr processor. rpc29y wrote If so how do I modify the solrconfig.xml to search these documents and highlight the found text? I guess you should first follow solr tutorial to know more about it, how query parser work, how to define your schema and then you may use highlight in right way. http://wiki.apache.org/solr/HighlightingParameters -- View this message in context: http://lucene.472066.n3.nabble.com/searching-and-text-highlighting-tp3917856p3919546.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Core not able access latest data indexed by multiple server.
Hi, I am using Solr multicore approach in my app. we have two different servers (ServerA1 and ServerA2) for load balancing, both the server accessing the same index repository and request will go to any server as per load balance algorithm. Problem occurs in following way [Note that both the servers accessing the same physical location(index)]. - ADD TO INDEX request for File1 go to ServerA1 for core CR1, core CR1 loaded in ServerA1 and indexing done. - ADD TO INDEX request for File2 go to ServerA2 for core CR1, core CR1 loaded in ServerA2 and indexing done. - SEARCH request for File2 go to ServerA1, now here core CR1 is already loaded so it directly access the index but File2 added by ServerA2 is not found in core loaded by ServerA1. So this is the problem, File2 indexed by core CR1 loaded in ServerA2 is not available in core CR1 loaded by ServerA1. I have searched and found that the solution to this problem is reload the CORE. when you reload the core, it will have latest indexed data. but reloading the Core for every request is very heavy and time consuming process. Please let me know if anyone has any solution for this. Waiting for your expert advice. Thanks Paresh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Core-not-able-access-latest-data-indexed-by-multiple-server-tp3919113p3919113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: need help to integrate SolrJ with my web application.
Hi Vijaya, Why not just making standard http calls to Solr as it was a RESTful Service? Just use a HTTP/REST Client on Spring, ask solr to return Json responses and get rid of all those war dependencies of SolrJ --- Marcelo On Monday, April 16, 2012, Ben McCarthy ben.mccar...@tradermedia.co.uk wrote: Hello, When I have seen this it usually means the SOLR you are trying to connect to is not available. Do you have it installed on: http://localhost:8080/solr Try opening that address in your browser. If your running the example solr using the embedded Jetty you wont be on 8080 :D Hope that helps -Original Message- From: Vijaya Kumar Tadavarthy [mailto:vijaya.tadavar...@ness.com] Sent: 16 April 2012 12:15 To: 'solr-user@lucene.apache.org' Subject: need help to integrate SolrJ with my web application. Hi All, I am trying to integrate solr with my Spring application. I have performed following steps: 1) Added below list of jars to my webapp lib folder. apache-solr-cell-3.5.0.jar apache-solr-core-3.5.0.jar apache-solr-solrj-3.5.0.jar commons-codec-1.5.jar commons-httpclient-3.1.jar lucene-analyzers-3.5.0.jar lucene-core-3.5.0.jar 2) I have added Tika jar files for processing binary files. tika-core-0.10.jar tika-parsers-0.10.jar pdfbox-1.6.0.jar poi-3.8-beta4.jar poi-ooxml-3.8-beta4.jar poi-ooxml-schemas-3.8-beta4.jar poi-scratchpad-3.8-beta4.jar 3) I have modified web.xml added below setup. filter filter-nameSolrRequestFilter/filter-name filter-classorg.apache.solr.servlet.SolrDispatchFilter/filter-class /filter filter-mapping filter-nameSolrRequestFilter/filter-name url-pattern/dataimport/url-pattern /filter-mapping servlet servlet-nameSolrServer/servlet-name servlet-classorg.apache.solr.servlet.SolrServlet/servlet-class load-on-startup1/load-on-startup /servlet servlet servlet-nameSolrUpdate/servlet-name servlet-classorg.apache.solr.servlet.SolrUpdateServlet/servlet-class load-on-startup2/load-on-startup /servlet servlet servlet-nameLogging/servlet-name servlet-classorg.apache.solr.servlet.LogLevelSelection/servlet-class /servlet servlet-mapping servlet-nameSolrUpdate/servlet-name url-pattern/update/*/url-pattern /servlet-mapping servlet-mapping servlet-nameLogging/servlet-name url-pattern/admin/logging/url-pattern /servlet-mapping I am trying to test this setup by running a simple java program with extract content of MS Excel file as below public SolrServer createNewSolrServer() { try { // setup the server... String url = http://localhost:8080/solr;; CommonsHttpSolrServer s = new CommonsHttpSolrServer( url ); s.setConnectionTimeout(100); // 1/10th sec s.setDefaultMaxConnectionsPerHost(100); s.setMaxTotalConnections(100); // where the magic happens s.setParser(new BinaryResponseParser()); s.setRequestWrit This e-mail is sent on behalf of Trader Media Group Limited, Registered Office: Auto Trader House, Cutbush Park Industrial Estate, Danehill, Lower Earley, Reading, Berkshire, RG6 4UT(Registered in England No. 4768833). This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses. -- Marcelo Carvalho Fernandes +55 21 8272-7970 +55 21 2205-2786
DIH + JNDI
Hi All, I'm new to solr and I don't have much experience in java. I'm trying to setup two environments with configuration files that mirror each other so that it's easy to copy files across after changes have been made. The problem is that they both access different sql servers. So I want to separate the data source from the data-import.xml. I'm trying to do that with JNDI following this doc http://tomcat.apache.org/tomcat-6.0-doc/jndi-datasource-examples-howto.html I put the datasource as a resource in my /etc/tomcat6/Catalina/localhost/solr.xml (Context) Resource name=jdbc/DATABASENAME auth=Container type=JdbcDataSource driver=com.microsoft.sqlserver.jdbc.SQLServerDriver url=jdbc:sqlserver://SQLSERVERNAME;databaseName=DATABASENAME;responseBuffering=adaptive; user=USERNAME password=PASSWORD / and the resource ref in /var/lib/tomcat6/webapps/solr/WEB-INF/web.xml resource-ref descriptionDB Connection/description res-ref-namejdbc/DATABASENAME/res-ref-name res-typeJdbcDataSource/res-type res-authContainer/res-auth /resource-ref Then I changed the data-config.xml to dataSource jndiName=java:comp/env/jdbc/DATABASENAME type=JdbcDataSource user= password=/ I restart the server and try to do a delta import and I get the following: SEVERE: Delta Import Failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select 1 as report_id Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextModifiedRowKey(SqlEntityProcessor.java:84) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextModifiedRowKey(EntityProcessorWrapper.java:262) at org.apache.solr.handler.dataimport.DocBuilder.collectDelta(DocBuilder.java:893) at org.apache.solr.handler.dataimport.DocBuilder.doDelta(DocBuilder.java:285) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:179) at org.apache.solr.handler.dataimport.DataImporter.doDeltaImport(DataImporter.java:390) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:429) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408) Caused by: javax.naming.NamingException: Cannot create resource instance at org.apache.naming.factory.ResourceFactory.getObjectInstance(ResourceFactory.java:143) at javax.naming.spi.NamingManager.getObjectInstance(NamingManager.java:321) at org.apache.naming.NamingContext.lookup(NamingContext.java:793) at org.apache.naming.NamingContext.lookup(NamingContext.java:140) at org.apache.naming.NamingContext.lookup(NamingContext.java:781) at org.apache.naming.NamingContext.lookup(NamingContext.java:140) at org.apache.naming.NamingContext.lookup(NamingContext.java:781) at org.apache.naming.NamingContext.lookup(NamingContext.java:140) at org.apache.naming.NamingContext.lookup(NamingContext.java:781) at org.apache.naming.NamingContext.lookup(NamingContext.java:153) at org.apache.naming.SelectorContext.lookup(SelectorContext.java:152) at javax.naming.InitialContext.lookup(InitialContext.java:409) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:140) at org.apache.solr.handler.dataimport.JdbcDataSource$1.call(JdbcDataSource.java:128) at org.apache.solr.handler.dataimport.JdbcDataSource.getConnection(JdbcDataSource.java:363) at org.apache.solr.handler.dataimport.JdbcDataSource.access$200(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:240) ... 11 more I've tried a couple of different alterations, I've only really succeeded in changing the error I get. Anyone know how fix this issue? I'm kind of lost here. Stephen
property substitution not working with multicore
Hi, I cannot seem to get right the configuration of using a properties file for cores (with 3.6.0). In Solr3 Entr. Search Server book they say this: This property substitution works in solr.xml , solrconfig.xml, schema.xml, and DIH configuration files. So my solr.xml is like this: cores adminPath=/admin/cores core name=core0 instanceDir=core0 dataDir=${config.datadir:/tmp/solr_data} properties=core0.properties / /cores core0.properties is in multicore/core0 (I tried with an absolute path too but does not work either) And my properties file has: config.datadir=c:\\tmp\\core0\\data config.db-data.jdbcUrl=jdbc:mysql:localhost\\... config.db-data.username=root config.db-data.password= None of those values are taken into account. I think I read in jira that dih does not support properties, but as they say in the book it does I just tried. The path to data dir should work right? But not even taht one, I always get the index in ./tmp/solr_data any hints? xab -- View this message in context: http://lucene.472066.n3.nabble.com/property-substitution-not-working-with-multicore-tp3919696p3919696.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Populating a filter cache by means other than a query
I guess my question is what advantage are you trying to get here? At the start, this feels like an XY problem. How are you intending to use the fq after you've built it? Because if there's any way to just create an fq clause, Solr will take care of it for you. Caching it, autowarming it when searchers are re-opened, etc. Otherwise, you're going to be re-inventing a bunch of stuff it seems to me, you'll have to intercept the queries coming in in order to apply the filter from the cache, etc. Which also may be another way of asking How big is this set of document IDs? If it's in the 100s, I'd just go with an fq. If it's more than that, I'd index some kind of set identifier that you could create for your fqs. And if this is gibberish, ignore me G.. Best Erick On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote: Hi, I am a long time Lucene user but new to solr. I would like to use something like the filterCache but build a such a cache not from a query but custom code. I guess I will ask my question by using techniques and vocab I am familiar with. Not sure its actually the right way so I appologize if its just the wrong approach. The scenario is that I would like to filter a result set by a set of labeled documents, I will call that set L. L contains app specific document IDs that are indexed as literals in the lucenefield myid. I would imagine I could build a OpenBitSet from enumerating the termdocs and look for the intersecting ids in my label set. Now I have my bitset that I assume I could use in a filter. Another approach would be to implement a hits collector, compute a fieldcache from that myid field and look for the intersection in a hashtable of L at scoring time, throwing out results that are not contained in the hashtable. Of course I am working within the confines / concepts that SOLR has layed out. Without going completely off the reservation is their a neat way of doing such a thing with SOLR? Glad to clarify if my question makes absolutely no sense. Best C
Re: How sorlcloud distribute data among shards of the same cluster?
Try looking at DistributedUpdateProcessor, there's a hash(cmd) method in there. Best Erick On Tue, Apr 17, 2012 at 4:45 PM, emma1023 smile.emma1...@gmail.com wrote: Thanks for your reply. In sorl 3.x, we need to manually hash the doc Id to the server.How does solrcloud do this instead? I am working on a project using solrcloud.But we need to monitor how the solrcloud distribute the data. I cannot find which part of the code it is from source code.Is it from the cloud part? Thanks. On Tue, Apr 17, 2012 at 3:16 PM, Mark Miller-3 [via Lucene] ml-node+s472066n3918192...@n3.nabble.com wrote: On Apr 17, 2012, at 9:56 AM, emma1023 wrote: It hashes the id. The doc distribution is fairly even - but sizes may be fairly different. How solrcloud manage distribute data among shards of the same cluster when you query? Is it distribute the data equally? What is the basis? Which part of the code that I can find about it?Thank you so much! -- View this message in context: http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3917323.html Sent from the Solr - User mailing list archive at Nabble.com. - Mark Miller lucidimagination.com -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918192.html To unsubscribe from How sorlcloud distribute data among shards of the same cluster?, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3917323code=c21pbGUuZW1tYTEwMjNAZ21haWwuY29tfDM5MTczMjN8LTYzMTg4ODk4Mw== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/How-sorlcloud-distribute-data-among-shards-of-the-same-cluster-tp3917323p3918348.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr hanging
Hi I am using Solr trunk and have 7 Solr instances running with 28 leaders and 28 replicas for a single collection. After indexing a while (a couple of days) the solrs start hanging and doing a thread dump on the jvm I see blocked threads like the following: Thread 2369: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=158 (Compiled frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=1987 (Compiled frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=399 (Compiled frame) - java.util.concurrent.ExecutorCompletionService.take() @bci=4, line=164 (Compiled frame) - org.apache.solr.update.SolrCmdDistributor.checkResponses(boolean) @bci=27, line=350 (Compiled frame) - org.apache.solr.update.SolrCmdDistributor.finish() @bci=18, line=98 (Compiled frame) - org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish() @bci=4, line=299 (Compiled frame) - org.apache.solr.update.processor.DistributedUpdateProcessor.finish() @bci=1, line=817 (Compiled frame) ... - org.mortbay.thread.QueuedThreadPool$PoolThread.run() @bci=25, line=582 (Interpreted frame) I read the stack trace as my indexing client has indexed a document and this Solr is now waiting for the replica? to respond before returning an answer to the client. The other Solrs have similar blocked threads. Any ideas of how I can get closer to the problem? Am I reading the stack trace correctly? Any further information that are relevant for commenting this problem? Thanks for any comments. Best regards Trym
Re: SOLR 4 / Date Query: Spurious Results: Is it me or ... ?
Your schema didn't come through, but... 1 why terms=-1 I don't know. I have a build from this morning and it's fine. When's yours? 2 date .vs. tdate. Yes, that's kind of confusing, but the Trie types inject some extra stuff in the field that allows the faster range queries, I think of it as navigation data. These get displayed as 1970 dates (e.g. the epoch). Ignore them. 3 I don't quite understand here. If you're still talking about a tdate field, could the navigation data account for it? That data shouldn't belong to any document and isn't really putting multi-values in any doc. Changing the schema type to not be multivalued should show this is the case if so. Best Erick On Tue, Apr 17, 2012 at 7:18 PM, vybe3142 vybe3...@gmail.com wrote: I wrote a custom handler that uses externally injected metadata (bypassing Tika et all) WRT Dates, I see them associated with the correct docs when retrieving all docs: BUT: looking at the schema analyzer, things look wierd: 1. Top terms = -1 2. The Dates are all mixed up with some spurious 1970 dates thrown in (I can get rid of the 1970 dates if i use type date vs tdate) 3. Multi Valued values (should only be one per doc, as per input data, even though the schema allows it). Any ideas what, if anything, I'm doing wrong? See pic http://lucene.472066.n3.nabble.com/file/n3918636/Capture.jpg Here's my SOLR schema: -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-Date-Query-Spurious-Results-Is-it-me-or-tp3918636p3918636.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr file size limit?
Dear fellow Solr users, I've been using Solr for a very short time now and I'm stuck. I'm trying to index a drupal website consisting of 1.2 million smaller nodes and 300k larger nodes (~400kb avg).. I'm using Solr 3.5 on a dedicated Ubuntu 10.04 box with 3TB of diskspace and 16GB of memory. I've tried using the sun JRE and OpenJDK, both resulting in the same problem. Indexing works great until my .fdt file reaches the size of 4.9GB/ 5217987319b. At this point when Solr starts merging it just keeps on merging, starting over and over.. Java is using all the available memory even though Xmx is set at 8G. When I restart Solr everything looks fine until merging is triggered. Whenever it hangs the server load averages 3, searching is possible but slow, the solr admin interface is reachable but sending new documents leads to a time-out. I've tried using several different settings for MergePolicy and started reindexing a couple of times but the behavior stays the same. My current solrconf.xml can be found here: http://pastebin.com/NXDT0B8f. I'm unable to find errors in the log which makes it really difficult to debug.. Could anyone point me in the right direction? I've already asked my question on stackoverflow without receiving a solution: http://stackoverflow.com/questions/9993633/apache-solr-3-5-hangs-when-indexing. Maybe it can provide you with some more information. Kind regards! Bram Rongen
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Here is a summary on all of them: * Search on HBase - I assume you are referring to the same thing I mentioned above. Not ready. * Solandra - uses Cassandra+Solr, plus DataStax now has a different (commercial) offering that combines search and Cassandra. Looks good. * Lily - data stored in HBase cluster gets indexed to a separate Solr instance(s) on the side. Not really integrated the way you want it to be. * ElasticSearch - solid at this point, the most dynamic solution today, can scale well (we are working on a mny-B documents index and hundreds of nodes with ElasticSearch right now), etc. But again, not integrated with Hadoop the way you want it. * IndexTank - has some technical weaknesses, not integrated with Hadoop, not sure about its future considering LinkedIn uses Zoie and Sensei already. * And there is SolrCloud, which is coming soon and will be solid, but is again not integrated. If I were you and I had to pick today - I'd pick ElasticSearch if I were completely open. If I had Solr bias I'd give SolrCloud a try first. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? I don't know off the topic of my head, but I'm guessing several hundred
Re: Multiple document structure
On 18 April 2012 10:05, abhijit bashetti bashettiabhi...@rediffmail.com wrote: Hi , Is it possible to have 2 document structures in solr? [...] Do not think so, but why do you need it? Use two separate indices, either in a multi-core setup, or in separate Solr instances. Regards, Gora
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
AFAIK it can not. You can only add new shards by creating a new index and you will then need to index new data into that new index. Index aliases are useful mainly for searching part. So it means that you need to plan for this when you implement your indexing logic. On the other hand the query logic does not need to change as you only add new indices and give them all the same alias. I am not an expert on this but I think that index splitting and re-sharding can be expensive for [near] real-time search system and the point is that you can probably use different techniques to support your large scale needs. Index aliasing and routing in elasticsearch can help a lot in supporting various large scale data scenarios, check the following thread in ES ML for some examples: https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ Just to sum it up, the fact that elasticsearch does have fixed number of shards per index and does not support resharding and index splitting does not mean you can not scale your data easily. (I was not following this whole thread in every detail. So may be you may have specific needs that can be solved only by splitting or resharding, in such case I would recommend you to ask on ES ML with further questions, I do not want to run into system X vs system Y flame here...) Regards, Lukas On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). There is no such thing just yet. There is no Search+Hadoop/HDFS in a box just yet. There was an attempt to automatically index HBase content, but that was either not completed or not committed into HBase. However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank
pushing updates to solr from postgresql
i have a setup right this instant where the dataimporthandler is being used to pull data for an index from a postgresql server. i'd like to switch over to push, and am looking for some validation of my approach. i have perl installed as an untrusted language on my postgresql server and am planning to set up triggers on the tables where insert/update/delete operations should cause an update of the relevant solr indexes. the trigger functions will build xml in the format for UpdateXmlMessages and notify Solr via http requests. is this sensible, or am i missing something easier? also, does anyone have any thoughts about coordinating initial indexing/full reindexing via dataimporthandler with the trigger based push operations? thanks, richard
hierarchical faceting?
I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Problems with edismax parser and solr3.6
I just looked through my logs of solr 3.6 and saw several 0 hits which were not seen with solr 3.5. While tracing this down it turned out that edismax don't like queries of type ...q=(text:ide)... any more. If parentheses around the query term the edismax fails with solr 3.6. Can anyone confirm this and give me feedback? Bernd
Re: hierarchical faceting?
Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
How to add/remove/customize search tabs
I have Apache Solr installed with my Drupal 7 site and noticed some default tabs available (Content, Site, Users). Is there a way to add/change that tabs section? CONFIDENTIALITY NOTICE: This email constitutes an electronic communication within the meaning of the Electronic Communications Privacy Act, 18 U.S.C. 2510, and its disclosure is strictly limited to the named recipient(s) intended by the sender of this message. This email, and any attachments, may contain confidential and/or proprietary information of Scientific Research Corporation. If you are not a named recipient, any copying, using, disclosing or distributing to others the information in this email and attachments is STRICTLY PROHIBITED. If you have received this email in error, please notify the sender immediately and permanently delete the email, any attachments, and all copies thereof from any drives or storage media and destroy any printouts or hard copies of the email and attachments. EXPORT COMPLIANCE NOTICE: This email and any attachments may contain technical data subject to U.S export restrictions under the International Traffic in Arms Regulations (ITAR) or the Export Administration Regulations (EAR). Export or transfer of this technical data and/or related information to any foreign person(s) or entity(ies), either within the U.S. or outside of the U.S., may require advance export authorization by the appropriate U.S. Government agency prior to export or transfer. In addition, technical data may not be exported or transferred to certain countries or specified designated nationals identified by U.S. embargo controls without prior export authorization. By accepting this email and any attachments, all recipients confirm that they understand and will comply with all applicable ITAR, EAR and embargo compliance requirements.
Re: How to add/remove/customize search tabs
This is question is probably better set on the Drupal groups page for Apache Solr http://groups.drupal.org/lucene-nutch-and-solr As this is more of a Drupal issue than a Solr issue On 18 Apr 2012, at 16:11, Valentin, AJ wrote: I have Apache Solr installed with my Drupal 7 site and noticed some default tabs available (Content, Site, Users). Is there a way to add/change that tabs section? CONFIDENTIALITY NOTICE: This email constitutes an electronic communication within the meaning of the Electronic Communications Privacy Act, 18 U.S.C. 2510, and its disclosure is strictly limited to the named recipient(s) intended by the sender of this message. This email, and any attachments, may contain confidential and/or proprietary information of Scientific Research Corporation. If you are not a named recipient, any copying, using, disclosing or distributing to others the information in this email and attachments is STRICTLY PROHIBITED. If you have received this email in error, please notify the sender immediately and permanently delete the email, any attachments, and all copies thereof from any drives or storage media and destroy any printouts or hard copies of the email and attachments. EXPORT COMPLIANCE NOTICE: This email and any attachments may contain technical data subject to U.S export restrictions under the International Traffic in Arms Regulations (ITAR) or the Export Administration Regulations (EAR). Export or transfer of this technical data and/or related information to any foreign person(s) or entity(ies), either within the U.S. or outside of the U.S., may require advance export authorization by the appropriate U.S. Government agency prior to export or transfer. In addition, technical data may not be exported or transferred to certain countries or specified designated nationals identified by U.S. embargo controls without prior export authorization. By accepting this email and any attachments, all recipients confirm that they understand and will comply with all applicable ITAR, EAR and embargo compliance requirements. David Stuart M +44(0) 778 854 2157 T +44(0) 845 519 5465 www.axistwelve.com Axis12 Ltd | 7 Wynford Road | London | N1 9QN | UK AXIS12 - Enterprise Web Solutions Reg Company No. 7215135 VAT No. 997 4801 60 This e-mail is strictly confidential and intended solely for the ordinary user of the e-mail account to which it is addressed. If you have received this e-mail in error please inform Axis12 immediately by return e-mail or telephone. We advise that in keeping with good computing practice the recipient of this e-mail should ensure that it is virus free. We do not accept any responsibility for any loss or damage that may arise from the use of this email or its contents.
Re: hierarchical faceting?
Yah, that's exactly what PathHierarchyTokenizer does. fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory/ /analyzer /fieldType I think I have a query time tokenizer that tokenizes at / ?q=colors:red == Doc1, Doc2 ?q=colors:redfoobar == ?q=colors:red/foobarasdfoaijao == Doc1, Doc2 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.com wrote: Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
minimum match and not matched words / term frequency in query result
Hi I have a dismax query with a mininimum match settings, this allows some terms to be missing in query results. I would like give a feedback to the user, highlighting the not matched words. It would be interesting also to show the words with a very low frequence. For instance searching for purple pendrive I would highlight that the results ignore the term purple, beacuse we don't have any. Can you suggest how to approach the problem? I was thinking about the debugQuery output, but since I will not get details about all the results I probably will miss something. I am trying to write a new SearchComponent but I don't know how to get term frequency data from a ResponseBuilder object... I am new to solr/lucene programming. Thanks a lot
Solr 3.6 parsing and extraction files
Could someone possibly provide me with a list of jars that I need to extract from the apache-solr-3.6.0.tgz file to enable the parsing and remote streaming of office style documents? I assume (for a multicore configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - correct? Thanks - Tod
Re: pushing updates to solr from postgresql
Hi Richard, One thing to think about here is what you will do when Solr is unavailable to take a new document for whatever reason. If you send docs to Solr from PG, docs either get indexed or not. So you may have to catch errors and then mark documents in PG as not indexed. You may want to keep track of initial and/or last index attempt and the total number of indexing attempts (new DB columns) and will probably want to use DIH to pick up unindexed documents from PG and get them indexed. Also keep in mind that sending docs to Solr one by one will not be as efficient as sending batches of them or as efficient as getting a batch of them via DIH. If your data volume is low this likely won't be a problem, but if it is it high or is growing, you'll want to keep this in mind. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Welty, Richard rwe...@ltionline.com To: solr-user@lucene.apache.org Sent: Wednesday, April 18, 2012 10:48 AM Subject: pushing updates to solr from postgresql i have a setup right this instant where the dataimporthandler is being used to pull data for an index from a postgresql server. i'd like to switch over to push, and am looking for some validation of my approach. i have perl installed as an untrusted language on my postgresql server and am planning to set up triggers on the tables where insert/update/delete operations should cause an update of the relevant solr indexes. the trigger functions will build xml in the format for UpdateXmlMessages and notify Solr via http requests. is this sensible, or am i missing something easier? also, does anyone have any thoughts about coordinating initial indexing/full reindexing via dataimporthandler with the trigger based push operations? thanks, richard
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
The main point being made is established NoSQL solutions (eg, Cassandra, HBase, et al) have solved the update problem (among many other scalability issues, for several years). If an update is being performed and it is not known where the record exists, the update capability of the system is inefficient. In addition, in a production system, the mere possibility of losing data, or inaccurate updates is usually a red flag. On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: AFAIK it can not. You can only add new shards by creating a new index and you will then need to index new data into that new index. Index aliases are useful mainly for searching part. So it means that you need to plan for this when you implement your indexing logic. On the other hand the query logic does not need to change as you only add new indices and give them all the same alias. I am not an expert on this but I think that index splitting and re-sharding can be expensive for [near] real-time search system and the point is that you can probably use different techniques to support your large scale needs. Index aliasing and routing in elasticsearch can help a lot in supporting various large scale data scenarios, check the following thread in ES ML for some examples: https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ Just to sum it up, the fact that elasticsearch does have fixed number of shards per index and does not support resharding and index splitting does not mean you can not scale your data easily. (I was not following this whole thread in every detail. So may be you may have specific needs that can be solved only by splitting or resharding, in such case I would recommend you to ask on ES ML with further questions, I do not want to run into system X vs system Y flame here...) Regards, Lukas On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm curious how on the fly updates are handled as a new shard is added to an alias. Eg, how does the system know to which shard to send an update? On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, speaking about ES I think it would be fair to mention that one has to specify number of shards upfront when the index is created - that is correct, however, it is possible to give index one or more aliases which basically means that you can add new indices on the fly and give them same alias which is then used to search against. Given that you can add/remove indices, nodes and aliases on the fly I think there is a way how to handle growing data set with ease. If anyone is interested such scenario has been discussed in detail in ES mail list. Regards, Lukas On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen jason.rutherg...@gmail.com wrote: One of big weaknesses of Solr Cloud (and ES?) is the lack of the ability to redistribute shards across servers. Meaning, as a single shard grows too large, splitting the shard, while live updates. How do you plan on elastically adding more servers without this feature? Cassandra and HBase handle elasticity in their own ways. Cassandra has successfully implemented the Dynamo model and HBase uses the traditional BigTable 'split'. Both systems are complex though are at a singular level of maturity. Also Cassandra [successfully] implements multiple data center support, is that available in SC or ES? On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hello Ali, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. That's fine. Whether it's doable with any tech will depend on how much hardware you give it, among other things. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. Yup, OK. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to
[Job] Search Engineer Lead at Sematext International
Hello, If you've always wanted a full-time job working with Solr, ElasticSearch, or Lucene, we have a position that is all about that, offers path to team leadership, and will expose a person to a healthy mixture of engineering and business. If you are interested, please send your resume to j...@sematext.com . Otis Sematext International is looking for a strong Search Engineer with interest and ability to interact with clients and with potential to build and lead local and/or remote development teams. By “client-facing” we really mean primarily email, phone, Skype. A person in this role needs to be able to: * design large scale search systems * have solid knowledge of either Solr or ElasticSearch or both * efficiently troubleshoot performance, relevance, and other search-related issues * speak and interact with clients Pluses – beyond pure engineering: * ability and desire to expand and lead a development/consulting teams * ability to think both business and engineering * ability to build products based on observed client needs * ability to present in public, at meetups, conferences, etc * ability to contribute to blog.sematext.com * active participation in online search communities * attention to detail * desire to share knowledge and teach * positive attitude, humor, agility Location: * New York Travel: * Minimal Relevant pointers: * http://sematext.com/about/jobs.html * http://sematext.com/about/jobs.html#advantages * http://sematext.com/engineering/index.html
solr stats component
Hello, I am using the stats component and I wanted help with range like function (in facet component). To be more clear, we would like to have a similar functionality of facet.range (i.e with gap and stuff) for the statistics component. That is, with one call we would like to do faceting in stats compoenent that would return us the facets only for a specified range broken down into several buckets (based on the gap). We know that this functionality is not available in solr but wanted to see if there's any other indirect way of doing it. Any thoughts would be highly appreciated. Thanks
Maximum Open Cursors using JdbcDataSource and cacheImpl
After upgrading from 3.5.0 to 3.6.0 we have noticed that when we use a cacheImpl on a nested JdbcDataSource entity, the database runs out of cursors. It does not matter what transactionIsolation, autoCommit, or holdability setting we use. I have only been using solr for a few months but after looking at EntityProcessorBase, DIHCacheSupport, and JdbcDataSource.ResultSetIterator it may be that the ResultSet or Statement is never closed. In EntityProcessBase.getNext() if there is no cacheSupport it likely immediately closes the resources it was using. Whereas with caching it might be leaving it open because the rowIterator is never set to null. Since it has a reference to the resultSet and stmt it holds onto them and neither is ever closed. On a related note there appear to be other possible leaks in JdbcDataSource.ResultSetIterator. The close() method attempts to close both the resultSet and the stmt. However if it fails closing the resultSet it will not close the stmt. They should probably be wrapped in separate try/catch blocks. It will also not close the stmt or resultSet if the ResultSetIterator throws an exception in its constructor. In my experience one cannot count on the closing of the connection to cleanup those resources consistently. 2012-04-18 12:02:22,017 ERROR [org.apache.solr.handler.dataimport.DataImporter] Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619) ... 5 more Caused by: java.sql.SQLException: ORA-01000: maximum open cursors exceeded at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288) at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:745) at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:210) at oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:804) at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1049) at oracle.jdbc.driver.T4CStatement.executeMaybeDescribe(T4CStatement.java:845) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1146) at
RE: Maximum Open Cursors using JdbcDataSource and cacheImpl
Keith, Can you supply your data-config.xml ? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Keith Naas [mailto:keithn...@dswinc.com] Sent: Wednesday, April 18, 2012 11:43 AM To: solr-user@lucene.apache.org Subject: Maximum Open Cursors using JdbcDataSource and cacheImpl After upgrading from 3.5.0 to 3.6.0 we have noticed that when we use a cacheImpl on a nested JdbcDataSource entity, the database runs out of cursors. It does not matter what transactionIsolation, autoCommit, or holdability setting we use. I have only been using solr for a few months but after looking at EntityProcessorBase, DIHCacheSupport, and JdbcDataSource.ResultSetIterator it may be that the ResultSet or Statement is never closed. In EntityProcessBase.getNext() if there is no cacheSupport it likely immediately closes the resources it was using. Whereas with caching it might be leaving it open because the rowIterator is never set to null. Since it has a reference to the resultSet and stmt it holds onto them and neither is ever closed. On a related note there appear to be other possible leaks in JdbcDataSource.ResultSetIterator. The close() method attempts to close both the resultSet and the stmt. However if it fails closing the resultSet it will not close the stmt. They should probably be wrapped in separate try/catch blocks. It will also not close the stmt or resultSet if the ResultSetIterator throws an exception in its constructor. In my experience one cannot count on the closing of the connection to cleanup those resources consistently. 2012-04-18 12:02:22,017 ERROR [org.apache.solr.handler.dataimport.DataImporter] Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to execute query: select distinct DISPLAY_NAME from dimension where dimension.DIMENSION_ID = 'M' Processing Document # 11 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.init(JdbcDataSource.java:253) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:210) at org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:39) at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619) ... 5 more Caused by: java.sql.SQLException: ORA-01000: maximum open cursors exceeded at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:112) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:331) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:288) at oracle.jdbc.driver.T4C8Oall.receive(T4C8Oall.java:745) at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:210) at oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:804) at
Re: SOLR 4 / Date Query: Spurious Results: Is it me or ... ?
Thanks for clarifying. I figured out the (terms=-1). It was my fault. I attempted a truncate of the index in my test case setup by issuing a delete query and think the subsequent commit might not have taken effect by the time the subsequent index queries started. -- View this message in context: http://lucene.472066.n3.nabble.com/SOLR-4-Date-Query-Spurious-Results-Is-it-me-or-tp3918636p3920652.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: hierarchical faceting?
It looks like TextField is the problem. This fixed: fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I am assuming the text_path fields won't include whitespace characters. ?q=colors:red/pink == Doc2 (Doc1, which has colors = red isn't included!) Is there a tokenizer that tokenizes the string as one token? I tried to extend Tokenizer myself but it fails: public class AsIsTokenizer extends Tokenizer { @Override public boolean incrementToken() throws IOException { return true;//or false; } } On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote: Yah, that's exactly what PathHierarchyTokenizer does. fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory/ /analyzer /fieldType I think I have a query time tokenizer that tokenizes at / ?q=colors:red == Doc1, Doc2 ?q=colors:redfoobar == ?q=colors:red/foobarasdfoaijao == Doc1, Doc2 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote: Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Can you suggest a method or pattern to consistently promote a document with any query?
Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- chris_war...@yahoo.com
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Chris - Take a look - QueryElevationComponent http://wiki.apache.org/solr/QueryElevationComponent -Jeevanandam On Apr 18, 2012, at 10:46 PM, Chris Warner wrote: Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- chris_war...@yahoo.com
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Chris, I haven't checked if Elevate Component has an easy way to push a specific doc for *all* queries, but have a look http://wiki.apache.org/solr/QueryElevationComponent Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - Original Message - From: Chris Warner chris_war...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Wednesday, April 18, 2012 1:16 PM Subject: Can you suggest a method or pattern to consistently promote a document with any query? Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- chris_war...@yahoo.com
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Thanks, Jeevanandam and Otis, I'll take another look at Elevate. My first attempts did not yield success, as I was not able to find a way to elevate a document with a *:* query. Perhaps I'll try a * query to see what happens. Cheers, Chris - Original Message - From: Jeevanandam Madanagopal je...@myjeeva.com To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com Cc: Sent: Wednesday, April 18, 2012 10:21 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? Chris - Take a look - QueryElevationComponent http://wiki.apache.org/solr/QueryElevationComponent -Jeevanandam On Apr 18, 2012, at 10:46 PM, Chris Warner wrote: Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris --
Re: Can you suggest a method or pattern to consistently promote a document with any query?
That is not a useful test. Users don't look for *:*. Test with real queries. wunder On Apr 18, 2012, at 10:27 AM, Chris Warner wrote: Thanks, Jeevanandam and Otis, I'll take another look at Elevate. My first attempts did not yield success, as I was not able to find a way to elevate a document with a *:* query. Perhaps I'll try a * query to see what happens. Cheers, Chris - Original Message - From: Jeevanandam Madanagopal je...@myjeeva.com To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com Cc: Sent: Wednesday, April 18, 2012 10:21 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? Chris - Take a look - QueryElevationComponent http://wiki.apache.org/solr/QueryElevationComponent -Jeevanandam On Apr 18, 2012, at 10:46 PM, Chris Warner wrote: Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- -- Walter Underwood wun...@wunderwood.org
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Browsing all documents and all facets, skipper. Cheers, Chris - Original Message - From: Walter Underwood wun...@wunderwood.org To: solr-user@lucene.apache.org Cc: Sent: Wednesday, April 18, 2012 10:29 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? That is not a useful test. Users don't look for *:*. Test with real queries. wunder On Apr 18, 2012, at 10:27 AM, Chris Warner wrote: Thanks, Jeevanandam and Otis, I'll take another look at Elevate. My first attempts did not yield success, as I was not able to find a way to elevate a document with a *:* query. Perhaps I'll try a * query to see what happens. Cheers, Chris - Original Message - From: Jeevanandam Madanagopal je...@myjeeva.com To: solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com Cc: Sent: Wednesday, April 18, 2012 10:21 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? Chris - Take a look - QueryElevationComponent http://wiki.apache.org/solr/QueryElevationComponent -Jeevanandam On Apr 18, 2012, at 10:46 PM, Chris Warner wrote: Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- -- Walter Underwood wun...@wunderwood.org
Re: hierarchical faceting?
I don't use any of that stuff in my app, so not sure how it works. I just manage my taxonomy outside of solr at index time and don't need any special fields or tokenizers. I use a string field type and insert the proper field at index time and query it normally. Nothing special required. On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote: It looks like TextField is the problem. This fixed: fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I am assuming the text_path fields won't include whitespace characters. ?q=colors:red/pink == Doc2 (Doc1, which has colors = red isn't included!) Is there a tokenizer that tokenizes the string as one token? I tried to extend Tokenizer myself but it fails: public class AsIsTokenizer extends Tokenizer { @Override public boolean incrementToken() throws IOException { return true;//or false; } } On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote: Yah, that's exactly what PathHierarchyTokenizer does. fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory/ /analyzer /fieldType I think I have a query time tokenizer that tokenizes at / ?q=colors:red == Doc1, Doc2 ?q=colors:redfoobar == ?q=colors:red/foobarasdfoaijao == Doc1, Doc2 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote: Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Thanks to those who responded. A more thorough reading of the wiki and I see the need for forceElevation=true in the elevate query. Cheers, Chris - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com Cc: Sent: Wednesday, April 18, 2012 10:23 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? Chris, I haven't checked if Elevate Component has an easy way to push a specific doc for *all* queries, but have a look http://wiki.apache.org/solr/QueryElevationComponent Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - Original Message - From: Chris Warner chris_war...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Wednesday, April 18, 2012 1:16 PM Subject: Can you suggest a method or pattern to consistently promote a document with any query? Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- chris_war...@yahoo.com
Date granularity
A query search on a particular date: returns 1valid result (as expected). How can I alter the granularity of the search for example , to all matches on the particular DAY? Reading through various docs, I attempt to append /DAY but this doesn't seem to work (in fact I get 0 results back when querying). What am I neglecting? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: hierarchical faceting?
The PathHierarchyTokenizerFactory is intended for file path therefore assumes that all documents should be indexed with all of the paths to the parent folders but you are trying to use it for a taxonomy so you can't simply use the PathHierarchyTokenizerFactory. Use the analysis page ( http://localhost:8983/solr/admin/analysis.jsp) so that you can see what's happening with the content both at index and query time. Field (Type) text_path Field value (Index) red/pink Field value (Query) red/pink You'd notice that the result of both is identical, therefore explaining why both documents are retrieved: Index Analyzer: red red/pink Query Analyzer: red red/pink Carlos -Original Message- From: Darren Govoni [mailto:dar...@ontrenet.com] Sent: Wednesday, April 18, 2012 8:10 AM To: solr-user@lucene.apache.org Subject: Re: hierarchical faceting? Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Suggester
Using Solr 3.6, I am trying to get suggestions for phrases. I managed getting prefixed suggestions, but not suggestions for middle of phrase. Can this be achieved with built in Solr suggest, or do I need to create a special core for this purpose? Thanks in advance.
Re: Can you suggest a method or pattern to consistently promote a document with any query?
Chris - If you have defined 'last-components' in search handler, forceElevation=true may not required. It gets invoked in search life cycle arr name=last-components strelevator/str /arr -Jeevanandam On Apr 18, 2012, at 11:37 PM, Chris Warner wrote: Thanks to those who responded. A more thorough reading of the wiki and I see the need for forceElevation=true in the elevate query. Cheers, Chris - Original Message - From: Otis Gospodnetic otis_gospodne...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org; Chris Warner chris_war...@yahoo.com Cc: Sent: Wednesday, April 18, 2012 10:23 AM Subject: Re: Can you suggest a method or pattern to consistently promote a document with any query? Chris, I haven't checked if Elevate Component has an easy way to push a specific doc for *all* queries, but have a look http://wiki.apache.org/solr/QueryElevationComponent Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html - Original Message - From: Chris Warner chris_war...@yahoo.com To: solr-user@lucene.apache.org solr-user@lucene.apache.org Cc: Sent: Wednesday, April 18, 2012 1:16 PM Subject: Can you suggest a method or pattern to consistently promote a document with any query? Hi, folks, Perhaps I'm overlooking an obvious solution to a common desire... I'd like to return a specific document with every query, as the first result. As well, I'd like to have that document be the first result in a *:* query. I'm looking into index time boosting using the boost attribute on the appropriate doc. I haven't tested this yet, and I'm not sure this would do anything for the *:* queries. Thanks for any suggested reading or patterns... Best, Chris -- chris_war...@yahoo.com
Re: Solr file size limit?
On 4/18/2012 6:17 AM, Bram Rongen wrote: I'm using Solr 3.5 on a dedicated Ubuntu 10.04 box with 3TB of diskspace and 16GB of memory. I've tried using the sun JRE and OpenJDK, both resulting in the same problem. Indexing works great until my .fdt file reaches the size of 4.9GB/ 5217987319b. At this point when Solr starts merging it just keeps on merging, starting over and over.. Java is using all the available memory even though Xmx is set at 8G. When I restart Solr everything looks fine until merging is triggered. Whenever it hangs the server load averages 3, searching is possible but slow, the solr admin interface is reachable but sending new documents leads to a time-out. Solr 3.5 works a little differently than previous versions (MMAPs all the index files), so if you look at the memory usage as reported by the OS, it's going to look all wrong. I've got my max heap set to 8192M, but this is what top looks like: Mem: 64937704k total, 58876376k used, 6061328k free, 379400k buffers Swap: 8388600k total,77844k used, 8310756k free, 47080172k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 22798 ncindex 20 0 75.6g 21g 12g S 1.0 34.3 14312:55 java If you add up the 47GB it says it's using for the disk cache, the 6GB that it says is free, and the 21GB it says that Java has resident, you end up with considerably more than the 64GB total RAM the machine has, even if you include the 77MB of swap that's used. You can use the jstat command to get a better idea of how much RAM java really is using: jstat -gc -t pid 5000 Add up the S0C, S1C, EC, OC, and PC columns. The alignment is often wrong on this output, so you'll have to count the columns. If I do this for my system, I end up with 8462972 KB. Alternatively, if you have a GUI installed on the server or you have set up remote JMX, you can use JConsole to very easily get a correct number. The extra memory reported by the OS is not really being used, it is a side effect of the memory mapping used by the Lucene indexes. I've tried using several different settings for MergePolicy and started reindexing a couple of times but the behavior stays the same. My current solrconf.xml can be found here: http://pastebin.com/NXDT0B8f. I'm unable to find errors in the log which makes it really difficult to debug.. Could anyone point me in the right direction? A MergeFactor of 4 is extremely low and will result in very frequent merging. The default is 10. I use a value of 36, but that is unusually high. Looking at one of my indexes on that machine, the largest fdt file is 7657412 KB, the other three are tiny - 9880, 12160, and 28 KB. That index was recently optimized. The total index size is over 20GB. I have three indexes that size running in different cores on that machine. You're definitely not running into any limits as far as Solr is concerned. You might be running into I/O issues. Are you relying on autocommit, or explicitly committing your updates and waiting for the commit to finish before doing more updates? When there is segment merging, commits can take a really long time. If you are using autocommit or not waiting for manual commits to finish, it might get bad enough that one commit has not yet finished when another is ready to take place. I don't know what this would actually do, but it would not be a good situation. How have you created your 3TB of disk space? If you are using RAID5 or RAID6, you can run into very serious and unavoidable performance problems with writes. If it is a single disk, it may not provide enough IOPS for good performance. My servers also have 3TB of disk space, using six 1TB SATA drives in RAID10. The worst-case scenario for your merges is equivalent to an optimize. An optimize of one of my 20GB indexes takes 15 minutes even on RAID10, so I only optimize one large index once a day, so each large index gets optimized every six days. I hope this helps, but I'll be happy to try and offer more, within my skill set. Thanks, Shawn
Difference between Search result from Admin console and solr/browse
I have imported my xml documents from oracle database and indexed them. When I search *:* in *admin console *I do get results. My xml format is not close to what solr expects. but still when I search for any word that is part of my xml document Solr displays whole xml document. for example if I search for word voicemail solr displays xml documents that has word voicemail Now when I go to solr/browse and give *:* I do see some thing but each result is like below (no data) even if i search for same word voicemail I am getting below. Can some body !!please Advice! Price: Features: In Stock there are only two things I can think off, one is settings in solrconfig.xml(like below). requestHandler name=/browse class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=wtvelocity/str str name=v.templatebrowse/str str name=v.layoutlayout/str str name=titleSolritas/str str name=dftext/str str name=defTypeedismax/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=mlt.qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=mlt.fltext,features,name,sku,id,manu,cat/str int name=mlt.count3/int str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str -- View this message in context: http://lucene.472066.n3.nabble.com/Difference-between-Search-result-from-Admin-console-and-solr-browse-tp3921323p3921323.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr file size limit?
On 4/18/2012 6:17 AM, Bram Rongen wrote: I've been using Solr for a very short time now and I'm stuck. I'm trying to index a drupal website consisting of 1.2 million smaller nodes and 300k larger nodes (~400kb avg).. A followup to my previous reply: Your ramBufferSizeMB is only 32, the default in the example config. I have seen recommendations indicating that going beyond 128MB is not usually helpful. With such large input documents, that may not apply to you - try setting it to 512 or 1024. That will result in far fewer index segments being created. They will be larger, so merges will be much less frequent but take longer. Thanks, Shawn
Re: Populating a filter cache by means other than a query
Great question. The set could be in the millions. I over simplified the use case somewhat to protect the innocent :-}. If a user is querying a large set of documents (for the sake of argument lets say its high tens of millions but could be in the small billions), they want to potentially mark a result set or subset of those docs with a label/tag and use that label /tag later. Now lets throw in its multi tenant system and we dont want to keep re-indexing documents to add these tags. Really what I would want todo is to execute a query filtering by this labeled set, the server fetches the labeled set out of local cache or over the wire or off disk and then incorporates it by one means or another as a filter (docset or hashtable in the hitcollector). Personally I think the dictionary approach wouldnt be a good one. It may produce the most optimal filter mechanism but will cost a bunch to construct the OpenBitSet. In a prior company I built a more generic version of this for not only filtering but for sorting, aggregate stats, etc. We didn't use Solr. I was curious if there was any methodology for plugging in such a scheme without taking a branch of solr and hacking at it. This was a multi tenant system where we were producing aggregate graphs, filtering and ranking by things such as entity level sentiment so we produced a rather generic solution here that as you pointed out reinvented perhaps some things that smell similar. It was about 7B docs and was multi tenant. Users were able to overide these features on a document level which was necessary so their counts, sorts etc worked correctly. Saying how long it took me to build and debug it if I can take something close off the shelf.well you know the rest of the story :-} C On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote: I guess my question is what advantage are you trying to get here? At the start, this feels like an XY problem. How are you intending to use the fq after you've built it? Because if there's any way to just create an fq clause, Solr will take care of it for you. Caching it, autowarming it when searchers are re-opened, etc. Otherwise, you're going to be re-inventing a bunch of stuff it seems to me, you'll have to intercept the queries coming in in order to apply the filter from the cache, etc. Which also may be another way of asking How big is this set of document IDs? If it's in the 100s, I'd just go with an fq. If it's more than that, I'd index some kind of set identifier that you could create for your fqs. And if this is gibberish, ignore me G.. Best Erick On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote: Hi, I am a long time Lucene user but new to solr. I would like to use something like the filterCache but build a such a cache not from a query but custom code. I guess I will ask my question by using techniques and vocab I am familiar with. Not sure its actually the right way so I appologize if its just the wrong approach. The scenario is that I would like to filter a result set by a set of labeled documents, I will call that set L. L contains app specific document IDs that are indexed as literals in the lucenefield myid. I would imagine I could build a OpenBitSet from enumerating the termdocs and look for the intersecting ids in my label set. Now I have my bitset that I assume I could use in a filter. Another approach would be to implement a hits collector, compute a fieldcache from that myid field and look for the intersection in a hashtable of L at scoring time, throwing out results that are not contained in the hashtable. Of course I am working within the confines / concepts that SOLR has layed out. Without going completely off the reservation is their a neat way of doing such a thing with SOLR? Glad to clarify if my question makes absolutely no sense. Best C
RE: Changing precisionStep without a re-index
In case anyone tries to do this... If you facet on a TrieField and change the precisionStep to 0, you'll need to re-index. Changing precisionStep to 0 changes the prefix returned by TrieField.getMainValuePrefix(FieldType), which then causes facets with a value of 0 to be returned. -Michael
Re: Date granularity
you could use a filter query like: fq=datefield:[NOW/DAY-1DAY TO NOW/DAY+1DAY] *replace datefield with your field that contains the time info On Wed, Apr 18, 2012 at 11:11 AM, vybe3142 vybe3...@gmail.com wrote: A query search on a particular date: returns 1valid result (as expected). How can I alter the granularity of the search for example , to all matches on the particular DAY? Reading through various docs, I attempt to append /DAY but this doesn't seem to work (in fact I get 0 results back when querying). What am I neglecting? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Difference between Search result from Admin console and solr/browse
Hi, The /browse Request Handler is built to showcase the xml documents in solr/example/exampledata and if you want to use it for your own data and schema you must modify the templates in solr/example/conf/velocity/ to display whatever you want to display. Given that you use an unmodified example schmema, you should be able to get more or less the same results as in Admin console (which uses the Lucene query parser on default field text ootb) by querying for text:voicemail. If you then click the enable debug link at the bottom of the page and then click the toggle all fields links below each result hit, you will see what is contained in each and every field. What you probably *should* do is to transform your oracle XMLs into XML that corresponds with Solr's schema, and you should tweak your schema and Velocity templates to match what you'd like to output in the reults. A simple way to prototype transforms is to write an XSL and using the XSLTUpdateRequestHandler at solr/update/xslt instead of the XML handler. See http://wiki.apache.org/solr/XsltUpdateRequestHandler -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 18. apr. 2012, at 22:49, srini wrote: I have imported my xml documents from oracle database and indexed them. When I search *:* in *admin console *I do get results. My xml format is not close to what solr expects. but still when I search for any word that is part of my xml document Solr displays whole xml document. for example if I search for word voicemail solr displays xml documents that has word voicemail Now when I go to solr/browse and give *:* I do see some thing but each result is like below (no data) even if i search for same word voicemail I am getting below. Can some body !!please Advice! Price: Features: In Stock there are only two things I can think off, one is settings in solrconfig.xml(like below). requestHandler name=/browse class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str str name=wtvelocity/str str name=v.templatebrowse/str str name=v.layoutlayout/str str name=titleSolritas/str str name=dftext/str str name=defTypeedismax/str str name=q.alt*:*/str str name=rows10/str str name=fl*,score/str str name=mlt.qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str str name=mlt.fltext,features,name,sku,id,manu,cat/str int name=mlt.count3/int str name=qf text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 /str -- View this message in context: http://lucene.472066.n3.nabble.com/Difference-between-Search-result-from-Admin-console-and-solr-browse-tp3921323p3921323.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: minimum match and not matched words / term frequency in query result
Hi, Which query terms that match may of course vary from document to document, so it would be hard to globally print non matching terms. But for each individual document match, you could deduct what terms do not match by enumerating what terms that DO match - using the explain output for instance. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 18. apr. 2012, at 17:34, giovanni.bricc...@banzai.it wrote: Hi I have a dismax query with a mininimum match settings, this allows some terms to be missing in query results. I would like give a feedback to the user, highlighting the not matched words. It would be interesting also to show the words with a very low frequence. For instance searching for purple pendrive I would highlight that the results ignore the term purple, beacuse we don't have any. Can you suggest how to approach the problem? I was thinking about the debugQuery output, but since I will not get details about all the results I probably will miss something. I am trying to write a new SearchComponent but I don't know how to get term frequency data from a ResponseBuilder object... I am new to solr/lucene programming. Thanks a lot
Re: Solr 3.6 parsing and extraction files
Hi, I suppose you want to POST office docs into Solr for text extraction using the Extracting RequestHandler (SolrCell). Have you read this page? http://wiki.apache.org/solr/ExtractingRequestHandler You basically need all libs provided by contrib/extraction. You can see in the example solr/conf/solrconfig.xml which lib ../ directives are included near the top of the file, this should give you a hint of how to configure your own solrconfig.xml depending on where you put those libs. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 18. apr. 2012, at 17:36, Tod wrote: Could someone possibly provide me with a list of jars that I need to extract from the apache-solr-3.6.0.tgz file to enable the parsing and remote streaming of office style documents? I assume (for a multicore configuration) they would go into ./tomcat/webapps/solr/WEB-INF/lib - correct? Thanks - Tod
Re: Populating a filter cache by means other than a query
Pesky users. Life would be so much easier if they'd just leave devs alone G Right. Well, you can certainly create your own SearchComponent and attach your custom filter at that point, note how I'm skimping on the details here. From left field, you might create a custom FunctionQuery that returns 0 in the case of excluded documents. Since that gets multiplied into the score, the resulting score is 0. Returning 1 for docs that should be kept wouldn't change the score. But other than that, I'll leave it to the folks in the code. Chris, you there? G.. Best Erick On Wed, Apr 18, 2012 at 5:14 PM, Chris Collins ch...@geekychris.com wrote: Great question. The set could be in the millions. I over simplified the use case somewhat to protect the innocent :-}. If a user is querying a large set of documents (for the sake of argument lets say its high tens of millions but could be in the small billions), they want to potentially mark a result set or subset of those docs with a label/tag and use that label /tag later. Now lets throw in its multi tenant system and we dont want to keep re-indexing documents to add these tags. Really what I would want todo is to execute a query filtering by this labeled set, the server fetches the labeled set out of local cache or over the wire or off disk and then incorporates it by one means or another as a filter (docset or hashtable in the hitcollector). Personally I think the dictionary approach wouldnt be a good one. It may produce the most optimal filter mechanism but will cost a bunch to construct the OpenBitSet. In a prior company I built a more generic version of this for not only filtering but for sorting, aggregate stats, etc. We didn't use Solr. I was curious if there was any methodology for plugging in such a scheme without taking a branch of solr and hacking at it. This was a multi tenant system where we were producing aggregate graphs, filtering and ranking by things such as entity level sentiment so we produced a rather generic solution here that as you pointed out reinvented perhaps some things that smell similar. It was about 7B docs and was multi tenant. Users were able to overide these features on a document level which was necessary so their counts, sorts etc worked correctly. Saying how long it took me to build and debug it if I can take something close off the shelf.well you know the rest of the story :-} C On Apr 18, 2012, at 4:38 AM, Erick Erickson wrote: I guess my question is what advantage are you trying to get here? At the start, this feels like an XY problem. How are you intending to use the fq after you've built it? Because if there's any way to just create an fq clause, Solr will take care of it for you. Caching it, autowarming it when searchers are re-opened, etc. Otherwise, you're going to be re-inventing a bunch of stuff it seems to me, you'll have to intercept the queries coming in in order to apply the filter from the cache, etc. Which also may be another way of asking How big is this set of document IDs? If it's in the 100s, I'd just go with an fq. If it's more than that, I'd index some kind of set identifier that you could create for your fqs. And if this is gibberish, ignore me G.. Best Erick On Tue, Apr 17, 2012 at 4:34 PM, Chris Collins ch...@geekychris.com wrote: Hi, I am a long time Lucene user but new to solr. I would like to use something like the filterCache but build a such a cache not from a query but custom code. I guess I will ask my question by using techniques and vocab I am familiar with. Not sure its actually the right way so I appologize if its just the wrong approach. The scenario is that I would like to filter a result set by a set of labeled documents, I will call that set L. L contains app specific document IDs that are indexed as literals in the lucenefield myid. I would imagine I could build a OpenBitSet from enumerating the termdocs and look for the intersecting ids in my label set. Now I have my bitset that I assume I could use in a filter. Another approach would be to implement a hits collector, compute a fieldcache from that myid field and look for the intersection in a hashtable of L at scoring time, throwing out results that are not contained in the hashtable. Of course I am working within the confines / concepts that SOLR has layed out. Without going completely off the reservation is their a neat way of doing such a thing with SOLR? Glad to clarify if my question makes absolutely no sense. Best C
Re: Multiple document structure
Solr does not enforce anything about documents conforming to the schema except: 1 a field specified in a doc must be present in the schema 2 any field in the schema with ' required=true ' must be present in the doc. Additionally there is no penalty for NOT putting all the fields defined in the schema into a particular document. What this means: Just create your schema with all the fields you'll need for both types of documents, probably along with a type field to distinguish the two. Now just index the separate document types in the same index. Best Erick On Wed, Apr 18, 2012 at 9:28 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 April 2012 10:05, abhijit bashetti bashettiabhi...@rediffmail.com wrote: Hi , Is it possible to have 2 document structures in solr? [...] Do not think so, but why do you need it? Use two separate indices, either in a multi-core setup, or in separate Solr instances. Regards, Gora
Re: Date granularity
If Peter's suggestion doesn't work, please post the results of adding debugQuery=on to your query. The date math stuff is sensitive to spaces, for instance and it's impossible to tell whether you're making a simple error like that without seeing what you're actually doing. Best Erick On Wed, Apr 18, 2012 at 6:46 PM, Peter Markey sudoma...@gmail.com wrote: you could use a filter query like: fq=datefield:[NOW/DAY-1DAY TO NOW/DAY+1DAY] *replace datefield with your field that contains the time info On Wed, Apr 18, 2012 at 11:11 AM, vybe3142 vybe3...@gmail.com wrote: A query search on a particular date: returns 1valid result (as expected). How can I alter the granularity of the search for example , to all matches on the particular DAY? Reading through various docs, I attempt to append /DAY but this doesn't seem to work (in fact I get 0 results back when querying). What am I neglecting? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Date-granularity-tp3920890p3920890.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Core not able access latest data indexed by multiple server.
I think you're trying to do something that's you shouldn't. The trunk SolrCloud stuff will address this issue, but for the 3.x code line having multiple servers opening up a shared index and writing to it will produce unpredictable results. This is really bad practice. You'd be far ahead setting up one of these machines as a master, the other as a slave, and always indexing to the master. Best Erick On Wed, Apr 18, 2012 at 1:17 AM, Paresh Modi pm...@asite.com wrote: Hi, I am using Solr multicore approach in my app. we have two different servers (ServerA1 and ServerA2) for load balancing, both the server accessing the same index repository and request will go to any server as per load balance algorithm. Problem occurs in following way [Note that both the servers accessing the same physical location(index)]. - ADD TO INDEX request for File1 go to ServerA1 for core CR1, core CR1 loaded in ServerA1 and indexing done. - ADD TO INDEX request for File2 go to ServerA2 for core CR1, core CR1 loaded in ServerA2 and indexing done. - SEARCH request for File2 go to ServerA1, now here core CR1 is already loaded so it directly access the index but File2 added by ServerA2 is not found in core loaded by ServerA1. So this is the problem, File2 indexed by core CR1 loaded in ServerA2 is not available in core CR1 loaded by ServerA1. I have searched and found that the solution to this problem is reload the CORE. when you reload the core, it will have latest indexed data. but reloading the Core for every request is very heavy and time consuming process. Please let me know if anyone has any solution for this. Waiting for your expert advice. Thanks Paresh -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Core-not-able-access-latest-data-indexed-by-multiple-server-tp3919113p3919113.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problems with edismax parser and solr3.6
Happened to see that Jan confirms this as a bug, see: https://issues.apache.org/jira/browse/SOLR-3377 On Wed, Apr 18, 2012 at 11:00 AM, Bernd Fehling bernd.fehl...@uni-bielefeld.de wrote: I just looked through my logs of solr 3.6 and saw several 0 hits which were not seen with solr 3.5. While tracing this down it turned out that edismax don't like queries of type ...q=(text:ide)... any more. If parentheses around the query term the edismax fails with solr 3.6. Can anyone confirm this and give me feedback? Bernd
Re: Problems with edismax parser and solr3.6
Hi, Thanks for reporting this. I've created a bug ticket for this at https://issues.apache.org/jira/browse/SOLR-3377 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 18. apr. 2012, at 17:00, Bernd Fehling wrote: I just looked through my logs of solr 3.6 and saw several 0 hits which were not seen with solr 3.5. While tracing this down it turned out that edismax don't like queries of type ...q=(text:ide)... any more. If parentheses around the query term the edismax fails with solr 3.6. Can anyone confirm this and give me feedback? Bernd