Re: How partial are partial updates
On Thu, Mar 26, 2015 at 12:23 PM, kennyk ke...@ontoforce.com wrote: Does solr have to reindex the whole document and not just the modified fields? yep. you are right. If so, can you give me an idea of the amount (factor) of speed gained by partial re-indexing? it's exactly the same what you have in indexing, and little bit worse, because you need to read stored fields. here is some notion of true field updates, but it doesn't updates inverted index, nor available in Solr. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
How partial are partial updates
Hi all, I have a question. Here https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents I read that /Solr supports several modifiers that atomically update values of a document. This allows updating only specific fields,/ and that /All original source fields must be stored for field modifiers to work correctly/ And here https://wiki.apache.org/solr/Atomic_Updates even more explicitly /Internally Solr re-adds the document to the index with the updated fields./ Does solr have to reindex the whole document and not just the modified fields? If so, can you give me an idea of the amount (factor) of speed gained by partial re-indexing? -- View this message in context: http://lucene.472066.n3.nabble.com/How-partial-are-partial-updates-tp4195441.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: German Compound Splitter words.fst causing problems.
Thanks for the tip Markus. We are using this filter to decompound German words. Update: I am on the path to victory. The words.fst file is actually built by the plugin, however there is a basic inputoutput file format mismatch (at the byte level) that doesn't occur with 4.0. As soon as you try to use lucene core 4.1 with this particular plugin, it breaks with the same error I was getting. The FST code in lucene says clearly that there is no guaranteed backward compatibility, so there you have it. I'm probably going to need to incorporate some older code from lucene and/or figure out how to make the plugin work with the new lucene code. -Chris. -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Wednesday, March 25, 2015 6:15 PM To: solr-user@lucene.apache.org Subject: RE: German Compound Splitter words.fst causing problems. Hello Chris - i don't know that token filter you mention but i would like to recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well if you provide the hyphenation rules and a dictionary. It has some flaws such as decompounding to irrelevant subwords, overlapping subwords or to subwords that do not form the whole compound word (minus genitives), but these can be fixed. Markus -Original message- From:Chris Morley ch...@depahelix.com Sent: Wednesday 25th March 2015 17:59 To: solr-user@lucene.apache.org Subject: German Compound Splitter words.fst causing problems. Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter by Dawid Weiss. I tried to upgrade the words.fst file that comes with the German compound-splitter using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get. cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader wordsFst Exception in thread main org.apache.lucene.index.IndexNotFoundException: org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e at org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118) at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85) The reason I'm attempting this at all is due to the answer here, http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7, which says to do the upgrade in a two step process, first using Solr 3.5, and then the latest Solr version (4.10.3). When I try this running the unit tests for my modified German compound-splitter I'm getting this same type of error. The thing is, this is an FST, not an index, which is a little confusing. The reason why I'm following this answer though, is because I'm getting that exact same message when trying to build the (modified) project with mavenat the point at which it tries to load in words.fst. Below. [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - Format version is not supported (resource: com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 (needs to be between 3 and 4). This version of Lucene only supports indexes created with release 3.0 and later. Failed to initialize static data structures for German compound splitter. Thanks, -Chris.
Installing the auto-phrase-tokenfilter
hello, I am after installing the auto-phrase-tokenfilter from https://github.com/LucidWorks/auto-phrase-tokenfilter. Can anyone point me to some documentation on how to do this? Thanks Luis Martinez -- View this message in context: http://lucene.472066.n3.nabble.com/Installing-the-auto-phrase-tokenfilter-tp4195466.html Sent from the Solr - User mailing list archive at Nabble.com.
Running test cases with ant
Hello , I am trying to run my test cases in solr using ant . I am using below command ant test –Dtestcase=Test -Dtests.leaveTemporary=true Now , here i have my own custom schema solrConfig . On running the above command on solr directiory , it builds the project again which overrides my schema.xml solrConfig.xml Due to this my test case fails because it is not able to find customized schema config . Let me know any suggestions Thanks Mrinali
Re: Running test cases with ant
On 3/26/2015 6:40 AM, Mrinali Agarwal wrote: I am trying to run my test cases in solr using ant . I am using below command ant test –Dtestcase=Test -Dtests.leaveTemporary=true Now , here i have my own custom schema solrConfig . On running the above command on solr directiory , it builds the project again which overrides my schema.xml solrConfig.xml Due to this my test case fails because it is not able to find customized schema config . Let me know any suggestions Take a look at org.apache.solr.search.TestLFUCache for an example of a test that loads a custom solrconfig. The custom config is here: solr/core/src/test-files/solr/collection1/conf/solrconfig-caching.xml The code in TestLFUCache.java that uses that config is: @BeforeClass public static void beforeClass() throws Exception { initCore(solrconfig-caching.xml, schema.xml); } Thanks, Shawn
Different methods of sending documents to Solr
Hi All, I am trying to post data into Solr using curl command. Does anybody could tell me the difference between the following two methods? Method1: curl http://localhost:8983/solr/update/extract?literal.id=doc1commit=true; -F myfile=@tutorial.html The -F flag instructs curl to POST data using the Content-Type multipart/form-data and supports the uploading of binary files. Method2: curl http://localhost:8983/solr/update/extract?literal.id=doc1defaultField=textcommit=true; --data-binary @tutorial.html -H 'Content-type:text/html' Consider my situation: I want to post many different content-types of files into Solr. Which method should I choose? Thank you so much. Sincerely, Xiaoha -- View this message in context: http://lucene.472066.n3.nabble.com/Different-methods-of-sending-documents-to-Solr-tp4195725.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: i'm a newb: questions about schema.xml
Yes, this is the correct page which will tell you more about this managed-schema thing in Solr 5.0.0. I got stuck in this for quite a while previously too. Regards, Edwin On 27 March 2015 at 08:20, Mark Bramer mbra...@esri.com wrote: Pretty sure I found what I am looking for: https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig I noticed the managed-schema file and a couple Google searches with that finally landed me at that link. Interesting that the file is hidden from the Files list in the Admin UI. Thanks! -Original Message- From: Mark Bramer Sent: Thursday, March 26, 2015 7:42 PM To: 'solr-user@lucene.apache.org' Subject: RE: i'm a newb: questions about schema.xml Hi Shawn, Definitely helpful to know about the instance and files stuff in Admin. I'm not running cloud, so I looked in the /conf directory but there's no schema.xml: Here's what's in my core's Files: currency.xml elevate.xml lang params.json protwords.txt solrconfig.xml stopwords.txt synonyms.txt and echoed by ls -l: -rw-r--r-- 1 root root 3974 Feb 15 11:38 currency.xml -rw-r--r-- 1 root root 1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root root 4096 Mar 23 10:46 lang -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema -rw-r--r-- 1 root root 308 Feb 15 11:38 params.json -rw-r--r-- 1 root root 873 Feb 15 11:38 protwords.txt -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml -rw-r--r-- 1 root root 781 Feb 15 11:38 stopwords.txt -rw-r--r-- 1 root root 1119 Feb 15 11:38 synonyms.txt -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 26, 2015 7:28 PM To: solr-user@lucene.apache.org Subject: Re: i'm a newb: questions about schema.xml On 3/26/2015 4:57 PM, Mark Bramer wrote: I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Hopefully you know what core you are using, so you can go to the admin UI and find it in the Core Selector dropdown list. Assuming you can do that, you will find yourself looking at the Overview tab for that core. https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface Once you are looking at the core overview, in the upper right corner of your browser window is a section called Instance ... which has an entry that is ALSO called Instance. Inside the directory indicated by that field, you should have a conf directory. The config and schema for that index are found in that conf directory. If you're running SolrCloud, then you can forget everything I just said ... the active configs will be found within the zookeeper database, and you can use the Cloud-Tree tab in the admin UI to find your collections and see which configName is linked to each
solr server datetime
Is it possible to retrieve the server datetime? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-server-datetime-tp4195728.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to create a core by API?
On Thu, Mar 26, 2015 at 1:31 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, looks like I stand corrected. I haven't kept complete track there, looks like this one didn't stick in my head. I'm not saying you're wrong. The configSet parameter doesn't work at all in my set up, so you might be right... I'm just wondering where that's documented. I thought Solr documentation was rough back in the 1.6 days, but wow... it's gotten shockingly bad in Solr 5. As far as the docs are concerned, all patches welcome! What kind of patch do you mean? Isn't all the documentation maintained on confluence? -- Mark E. Haase 202-815-0201
Re: How to create a core by API?
Okay, thanks for the feedback. I'll admit that I do find the cloud vs non-cloud deployment options a constant source of confusion, not the least of which is due to the name. If I run a single Solr instance on EC2, that's not cloud, but if I run a few instances with ZK on my local LAN, that is cloud. Mmmkay. I can't imagine why the API documentation wouldn't mention that the API can't actually do the thing it's supposed to do (create a core). What's the purpose of having an HTTP API if I'm expected to already have write access to the host's file system to use it? Maybe its intended as private API? It should only be used by Solr itself, e.g. `solr create -c foo` uses the Cores Admin API to do some (but not all) of its work. But if that's the case, then the API docs should say that. From an API consumer's point of view, I'm not really interested in being forced to learn the history of the project to use the API. The whole point of creating APIs is to abstract out details that the caller doesn't need to know, and yet this API requires an understanding of Solr's internal file structure and history of the project? Yikes. On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com wrote: Ok, you're being confused by cloud, non cloud and all that kinda stuff Configsets are SolrCloud only, so forget them since you specified it's not SolrCloud. bq: surely the HTTP API doesn't require the caller to create a directory and copy files first, does it In fact, yes. The thing to remember here is that you're using a much older approach that had its roots in the pre-cloud days. The problem is how do you insure that the configurations are on the node you're creating the core on? The whole configsets discussion is an attempt to solve that in SolrCloud by putting the configs in a place any Solr instance can find them, namely Zookeeper. But in non-cloud, there's no central repository. You could be firing the query from node X and creating the core on node Y. So Solr expects the config files to already be in place; you have to manually copy them to node Y anyway, why not copy them to the place they'll be needed? The scripts make an assumption that you're running on the same node you're running the scripts for quick-start purposes. Best, Erick On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote: I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or '/var/solr/data/new_core/conf' Solr isn't even creating /var/solr/data/new_core, which I guess is the root of the problem. But /var/solr is owned by the solr user and I can do `sudo -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making this directory? I see that 'instanceDir' is required, but I don't get an error message if I *don't* use it, so I'm not sure how required it actually is. I'm also not sure if its supposed to be a full path or a relative path or what, so here are a couple of other guesses at the correct incantation: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core ' curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core ' These both return the same error message as my first try, so no dice... FWIW, I get the same error message even if I try doing this with the Solr Admin GUI so I'm really puzzled. Is the GUI supposed to work? I found a thread on Stack Overflow about this same problem ( http://stackoverflow.com/a/28945428/122763) that suggests using configSet. Okay, the installer put some configs sets in /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config' config set has a solrconfig.xml in it, so maybe that would solve my solrconfig.xml error? If I compare the HTTP API to the `solr create -c foo` script, it appears that the script creates the instance directory and copies in conf files *before *it calls the HTTP API... surely the HTTP API doesn't require the caller to create a directory and copy files first, does it? -- Mark E. Haase -- Mark E. Haase 202-815-0201
Re: How to create a core by API?
Erick, are you sure that configSets don't apply to single-node Solr instances? https://cwiki.apache.org/confluence/display/solr/Config+Sets I don't see anything about Solr cloud there. Also, configSet is a documented argument to the Core Admin API: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE And one of the few things [I thought] I knew about cloud vs non cloud setups was the Collections API is for cloud and Cores API is for non cloud, right? So why would the non-cloud API take a cloud-only argument? On Thu, Mar 26, 2015 at 1:16 PM, Mark E. Haase meha...@gmail.com wrote: Okay, thanks for the feedback. I'll admit that I do find the cloud vs non-cloud deployment options a constant source of confusion, not the least of which is due to the name. If I run a single Solr instance on EC2, that's not cloud, but if I run a few instances with ZK on my local LAN, that is cloud. Mmmkay. I can't imagine why the API documentation wouldn't mention that the API can't actually do the thing it's supposed to do (create a core). What's the purpose of having an HTTP API if I'm expected to already have write access to the host's file system to use it? Maybe its intended as private API? It should only be used by Solr itself, e.g. `solr create -c foo` uses the Cores Admin API to do some (but not all) of its work. But if that's the case, then the API docs should say that. From an API consumer's point of view, I'm not really interested in being forced to learn the history of the project to use the API. The whole point of creating APIs is to abstract out details that the caller doesn't need to know, and yet this API requires an understanding of Solr's internal file structure and history of the project? Yikes. On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com wrote: Ok, you're being confused by cloud, non cloud and all that kinda stuff Configsets are SolrCloud only, so forget them since you specified it's not SolrCloud. bq: surely the HTTP API doesn't require the caller to create a directory and copy files first, does it In fact, yes. The thing to remember here is that you're using a much older approach that had its roots in the pre-cloud days. The problem is how do you insure that the configurations are on the node you're creating the core on? The whole configsets discussion is an attempt to solve that in SolrCloud by putting the configs in a place any Solr instance can find them, namely Zookeeper. But in non-cloud, there's no central repository. You could be firing the query from node X and creating the core on node Y. So Solr expects the config files to already be in place; you have to manually copy them to node Y anyway, why not copy them to the place they'll be needed? The scripts make an assumption that you're running on the same node you're running the scripts for quick-start purposes. Best, Erick On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote: I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or '/var/solr/data/new_core/conf' Solr isn't even creating /var/solr/data/new_core, which I guess is the root of the problem. But /var/solr is owned by the solr user and I can do `sudo -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making this directory? I see that 'instanceDir' is required, but I don't get an error message if I *don't* use it, so I'm not sure how required it actually is. I'm also not sure if its supposed to be a full path or a relative path or what, so here are a couple of other guesses at the correct incantation: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core ' curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core ' These both return the same error message as my first try, so no dice... FWIW, I get the same error message even if I try doing this with the Solr Admin GUI so I'm really puzzled. Is the GUI supposed to work? I found a thread on Stack Overflow about this same problem ( http://stackoverflow.com/a/28945428/122763) that suggests using configSet. Okay, the installer put some configs sets in /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config' config set has a solrconfig.xml in it, so maybe that would solve my solrconfig.xml error? If I compare the HTTP API to the
Re: Solr Monitoring - Stored Stats?
Have a look at the admin UI, plugins/stats. I’ve just spent the time to re-implement it in AngularJS, so I know the functionality is there - twice :-) You can “watch for changes” - it pulls in a reference XML, and posts that back to the server, which only reports back changes. Dunno if that gives you what you are after? Upayavira On Thu, Mar 26, 2015, at 03:15 PM, Matt Kuiper wrote: Erick, Shawn, Thanks for your responses. I figured this was the case, just wanted to check to be sure. I have used Zabbix to configure JMX points to monitor over time, but it was a bit of work to get configured. We are looking to create a simple dashboard of a few stats over time. Looks like the easiest approach will be to make an app to make calls for these stats at a regular interval and then index results to Solr, and then we will able to query over desired time frames... Thanks, Matt -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 25, 2015 10:30 AM To: solr-user@lucene.apache.org Subject: Re: Solr Monitoring - Stored Stats? Matt: Not really. There's a bunch of third-party log analysis tools that give much of this information (not everything exposed by JMX of course is in the log files though). Not quite sure whether things like Nagios, Zabbix and the like have this kind of stuff built in seems like a natural extension of those kinds of tools though Not much help here... Erick On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. Thanks, Matt
Uneven index distribution using composite router
Hi, I'm using a three level composite router in a solr cloud environment, primarily for multi-tenant and field collapsing. The format is as follows. *language!topic!url*. An example would be : ENU!12345!www.testurl.com/enu/doc1 GER!12345!www.testurl.com/ger/doc2 CHS!67890!www.testurl.com/chs/doc3 The Solr Cloud cluster contains 2 shard, each having 3 replicas. After indexing around 10 million documents, I'm observing that the index size in shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is getting indexed in shard 1. Since 60% of the document is english, I expect the index size to be higher on one shard, but the difference seem little too high. The idea is to make sure that all ENU!12345 documents are routed to one shard so that distributed field collapsing works. Is there something I can do differently here to make a better distribution ? Any pointers will be appreciated. Regards, Shamik
How to create a core by API?
I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or '/var/solr/data/new_core/conf' Solr isn't even creating /var/solr/data/new_core, which I guess is the root of the problem. But /var/solr is owned by the solr user and I can do `sudo -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making this directory? I see that 'instanceDir' is required, but I don't get an error message if I *don't* use it, so I'm not sure how required it actually is. I'm also not sure if its supposed to be a full path or a relative path or what, so here are a couple of other guesses at the correct incantation: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core ' curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core ' These both return the same error message as my first try, so no dice... FWIW, I get the same error message even if I try doing this with the Solr Admin GUI so I'm really puzzled. Is the GUI supposed to work? I found a thread on Stack Overflow about this same problem ( http://stackoverflow.com/a/28945428/122763) that suggests using configSet. Okay, the installer put some configs sets in /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config' config set has a solrconfig.xml in it, so maybe that would solve my solrconfig.xml error? If I compare the HTTP API to the `solr create -c foo` script, it appears that the script creates the instance directory and copies in conf files *before *it calls the HTTP API... surely the HTTP API doesn't require the caller to create a directory and copy files first, does it? -- Mark E. Haase
Re: Applying Tokenizers and Filters to CopyFields
Glad it worked out... Looking back, I can't believe I didn't mention adding debug=query to the URL. That would have shown you exactly what the parsed query looked like and you'd have seen right off that it wasn't searching against the field you thought it was. It's one of the first things I do when queries don't return what I expect. Glad it's working for you! Erick On Thu, Mar 26, 2015 at 8:24 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Glad you are sorted out! Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich martin...@gmx.net wrote: Thanks so much, Erick and Michael, for all the additional explanation. The crucial information in the end turned out to be the one about the Default Search Field („df“). In solrconfig.xml this parameter was to point to the original text, which is why the expanded queries didn’t work. When I set the df parameter to one of the fields with the expanded text, the search works fine. I have also removed the copyField declarations. It’s all working as expected now. Thanks again for the help. Cheers, Martin Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com: Martin: Perhaps this would help indexed=true, stored=true field can be searched. The raw input (not analyzed in any way) can be shown to the user in the results list. indexed=true, stored=false field can be searched. However, the field can't be returned in the results list with the document. indexed=false, stored=true The field cannot be searched, but the contents can be returned in the results list with the document. There are some use-cases where this is desirable behavior. indexed=false, stored=false The entire field is thrown out, it's just as if you didn't send the field to be indexed at all. And one other thing, the copyField gets the _raw_ data not the analyzed data. Let's say you have two fields, src and dst. copying from src to dest in schema.xml is identical to add doc field name=srcoriginal text/field field name=dstoriginal text/field /doc /add that is, copyfield directives are not chained. Also, watch out for your query syntax. Michael's comments are spot-on, I'd just add this: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true is kind of odd. Let's assume you mean qf rather than fq. That _only_ matters if your query parser is edismax, it'll be ignored in this case I believe. You'd want something like q=src:Sprache or q=dst:Sprache or even http://localhost:8983/solr/windex/select?q=Sprachedf=src http://localhost:8983/solr/windex/select?q=Sprachedf=dst where df is default field and the search is applied against that field in the absence of a field qualification like my first two examples. Best, Erick On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least, not easily). If what you really want is to store and retrieve case folded versions of your data as well as the original, you need to use something like a UpdateRequestProcessor, which I personally am less familiar with. On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net wrote: So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/
Re: How to create a core by API?
Hmmm, looks like I stand corrected. I haven't kept complete track there, looks like this one didn't stick in my head. As far as the docs are concerned, all patches welcome! Best, Erick On Thu, Mar 26, 2015 at 10:26 AM, Mark E. Haase meha...@gmail.com wrote: Erick, are you sure that configSets don't apply to single-node Solr instances? https://cwiki.apache.org/confluence/display/solr/Config+Sets I don't see anything about Solr cloud there. Also, configSet is a documented argument to the Core Admin API: https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-CREATE And one of the few things [I thought] I knew about cloud vs non cloud setups was the Collections API is for cloud and Cores API is for non cloud, right? So why would the non-cloud API take a cloud-only argument? On Thu, Mar 26, 2015 at 1:16 PM, Mark E. Haase meha...@gmail.com wrote: Okay, thanks for the feedback. I'll admit that I do find the cloud vs non-cloud deployment options a constant source of confusion, not the least of which is due to the name. If I run a single Solr instance on EC2, that's not cloud, but if I run a few instances with ZK on my local LAN, that is cloud. Mmmkay. I can't imagine why the API documentation wouldn't mention that the API can't actually do the thing it's supposed to do (create a core). What's the purpose of having an HTTP API if I'm expected to already have write access to the host's file system to use it? Maybe its intended as private API? It should only be used by Solr itself, e.g. `solr create -c foo` uses the Cores Admin API to do some (but not all) of its work. But if that's the case, then the API docs should say that. From an API consumer's point of view, I'm not really interested in being forced to learn the history of the project to use the API. The whole point of creating APIs is to abstract out details that the caller doesn't need to know, and yet this API requires an understanding of Solr's internal file structure and history of the project? Yikes. On Thu, Mar 26, 2015 at 12:56 PM, Erick Erickson erickerick...@gmail.com wrote: Ok, you're being confused by cloud, non cloud and all that kinda stuff Configsets are SolrCloud only, so forget them since you specified it's not SolrCloud. bq: surely the HTTP API doesn't require the caller to create a directory and copy files first, does it In fact, yes. The thing to remember here is that you're using a much older approach that had its roots in the pre-cloud days. The problem is how do you insure that the configurations are on the node you're creating the core on? The whole configsets discussion is an attempt to solve that in SolrCloud by putting the configs in a place any Solr instance can find them, namely Zookeeper. But in non-cloud, there's no central repository. You could be firing the query from node X and creating the core on node Y. So Solr expects the config files to already be in place; you have to manually copy them to node Y anyway, why not copy them to the place they'll be needed? The scripts make an assumption that you're running on the same node you're running the scripts for quick-start purposes. Best, Erick On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote: I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or '/var/solr/data/new_core/conf' Solr isn't even creating /var/solr/data/new_core, which I guess is the root of the problem. But /var/solr is owned by the solr user and I can do `sudo -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making this directory? I see that 'instanceDir' is required, but I don't get an error message if I *don't* use it, so I'm not sure how required it actually is. I'm also not sure if its supposed to be a full path or a relative path or what, so here are a couple of other guesses at the correct incantation: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core ' curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core ' These both return the same error message as my first try, so no dice... FWIW, I get the same error message even if I try doing this with the Solr Admin GUI so I'm really puzzled. Is the GUI supposed to work? I found a thread on Stack Overflow about this same problem ( http://stackoverflow.com/a/28945428/122763) that suggests using
Re: How to create a core by API?
Ok, you're being confused by cloud, non cloud and all that kinda stuff Configsets are SolrCloud only, so forget them since you specified it's not SolrCloud. bq: surely the HTTP API doesn't require the caller to create a directory and copy files first, does it In fact, yes. The thing to remember here is that you're using a much older approach that had its roots in the pre-cloud days. The problem is how do you insure that the configurations are on the node you're creating the core on? The whole configsets discussion is an attempt to solve that in SolrCloud by putting the configs in a place any Solr instance can find them, namely Zookeeper. But in non-cloud, there's no central repository. You could be firing the query from node X and creating the core on node Y. So Solr expects the config files to already be in place; you have to manually copy them to node Y anyway, why not copy them to the place they'll be needed? The scripts make an assumption that you're running on the same node you're running the scripts for quick-start purposes. Best, Erick On Thu, Mar 26, 2015 at 9:24 AM, Mark E. Haase meha...@gmail.com wrote: I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or '/var/solr/data/new_core/conf' Solr isn't even creating /var/solr/data/new_core, which I guess is the root of the problem. But /var/solr is owned by the solr user and I can do `sudo -u solr mkdir /var/solr/data/new_core` just fine. So why isn't Solr making this directory? I see that 'instanceDir' is required, but I don't get an error message if I *don't* use it, so I'm not sure how required it actually is. I'm also not sure if its supposed to be a full path or a relative path or what, so here are a couple of other guesses at the correct incantation: curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=new_core ' curl ' http://localhost:8983/solr/admin/cores?action=CREATEname=new_coreinstanceDir=/var/solr/data/new_core ' These both return the same error message as my first try, so no dice... FWIW, I get the same error message even if I try doing this with the Solr Admin GUI so I'm really puzzled. Is the GUI supposed to work? I found a thread on Stack Overflow about this same problem ( http://stackoverflow.com/a/28945428/122763) that suggests using configSet. Okay, the installer put some configs sets in /opt/solr/server /opt/solr/server/solr/configsets, and the 'basic_config' config set has a solrconfig.xml in it, so maybe that would solve my solrconfig.xml error? If I compare the HTTP API to the `solr create -c foo` script, it appears that the script creates the instance directory and copies in conf files *before *it calls the HTTP API... surely the HTTP API doesn't require the caller to create a directory and copy files first, does it? -- Mark E. Haase
Re: Uneven index distribution using composite router
right, when you take over routing, making sure the distribution is even is now your responsibility. Your assumption is that the amount of _text_ in each doc is roughly the same between your three languages, have you verified this? And are you doing anything like copyFields that are kicking in on one shard but not the others (e.g. if you have text_en fields you might be copying that to text_en_all but not doing so with text_ger to text_ger_all). that's totally a shot in the dark though. Best, Erick On Thu, Mar 26, 2015 at 10:26 AM, Shamik Bandopadhyay sham...@gmail.com wrote: Hi, I'm using a three level composite router in a solr cloud environment, primarily for multi-tenant and field collapsing. The format is as follows. *language!topic!url*. An example would be : ENU!12345!www.testurl.com/enu/doc1 GER!12345!www.testurl.com/ger/doc2 CHS!67890!www.testurl.com/chs/doc3 The Solr Cloud cluster contains 2 shard, each having 3 replicas. After indexing around 10 million documents, I'm observing that the index size in shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is getting indexed in shard 1. Since 60% of the document is english, I expect the index size to be higher on one shard, but the difference seem little too high. The idea is to make sure that all ENU!12345 documents are routed to one shard so that distributed field collapsing works. Is there something I can do differently here to make a better distribution ? Any pointers will be appreciated. Regards, Shamik
Re: How to create a core by API?
Got to the comments section and add any corrections you'd like, that'll get bubbled up. Best, Erick On Thu, Mar 26, 2015 at 10:45 AM, Mark E. Haase meha...@gmail.com wrote: On Thu, Mar 26, 2015 at 1:31 PM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, looks like I stand corrected. I haven't kept complete track there, looks like this one didn't stick in my head. I'm not saying you're wrong. The configSet parameter doesn't work at all in my set up, so you might be right... I'm just wondering where that's documented. I thought Solr documentation was rough back in the 1.6 days, but wow... it's gotten shockingly bad in Solr 5. As far as the docs are concerned, all patches welcome! What kind of patch do you mean? Isn't all the documentation maintained on confluence? -- Mark E. Haase 202-815-0201
Re: How to create a core by API?
On 3/26/2015 10:24 AM, Mark E. Haase wrote: I can't get the Core Admin API to work. I have a brand new installation of Solr 5.0.0 (in non-cloud mode). I installed using the installation script (a nice addition!) with default options, so I have Solr in /opt/solr and its data in /var/solr. Here's what I'm trying: curl 'http://localhost:8983/solr/admin/cores?action=CREATEname=new_core ' But I get this error: Error CREATEing SolrCore 'new_core': Unable to create core [new_core] Caused by: Can't find resource 'solrconfig.xml' in classpath or The error message tells you what is wrong. The CoreAdmin API requires that the instanceDir already exist, with a conf directory inside it that contains solrconfig.xml, schema.xml, and any other necessary config files. If you want completely from-scratch creation without any existing filesystem layout, you will need to run SolrCloud, which keeps config files in the zookeeper database. At that point you would be using the Collections API. If you go to Core Admin in the admin UI and click the Add Core button, you will see the following note: |instanceDir| and |dataDir| need to exist before you can create the core This message is not quite accurate -- the dataDir (defaulting to ${instanceDir}/data) will be created if it does not already exist, and the user running Solr has the required permissions to create it. The message also doesn't say anything about the conf directory or the two required XML files. Thanks, Shawn
Performance json vs javabin
Has anyone done performance tests between json and javabin? Scale tipped towards javabin when compared to XML(https://issues.apache.org/jira/browse/SOLR-486). I am curious to know if it is same with json when load is 600 per minute, for example. Thanks,
Re: Replacing a group of documents (Delete/Insert) without a query on the index ever showing an empty list (Docs)
On 3/26/2015 9:53 AM, Russell Taylor wrote: I have an index which is made up of groups of documents, each group is defined by a field called keyField (keyField:A). I need to delete all the keyField:A documents and replace them with a brand new set without the index ever returning zero documents on a query. At the moment I deleteByQuery:keyField:A and then insert a SolrInputDocument list via SolrJ into my index. I have a small time period where somebody doing a q=fieldKey:A can be returned an empty list. FYI: The keyField group might be just 100 documents or up to 10 million. As long as you don't have any commits with openSearcher=true happening between the delete and the insert, that would work ... but why go through the manual delete if you don't have to? If you define a suitable uniqueKey field in your schema, simply indexing a new document with the same value in the uniqueKeyfield as an existing document will delete the old document. https://wiki.apache.org/solr/UniqueKey Thanks, Shawn
delta import on changes in entity within a document
I have the following data-config: document name=locations entity pk=id name=location query=select * from locations WHERE isapproved='true' deltaImportQuery=select * from locations WHERE updatedate lt; getdate() AND isapproved='true' AND id='${dataimporter.delta.id}' deltaQuery=select id from locations where isapproved='true' AND updatedate gt; '${dataimporter.last_index_time}' entity name=offerdetails query=SELECT title as offer_title,ISNULL(img,'') as offer_thumb,id as offer_id ,startdate as offer_startdate ,enddate as offer_enddate ,description as offer_description ,updatedate as offer_updatedate FROM offers WHERE objectid=${location.id} /entity /document Now, when the object in the [locations] table is updated, my delta import (/dataimport?command=delta-import) query works perfectly. But when an offer is updated in the [offers] table, this is not seen by the deltaimport command. Is there way to delta-import only the updated offers for the respective location if an offer is updated? And then without: a. having to fully import ALL locations or b. having to update this single location and then do a regular deltaimport? -- View this message in context: http://lucene.472066.n3.nabble.com/delta-import-on-changes-in-entity-within-a-document-tp4195615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Uneven index distribution using composite router
Thanks for your reply Eric. In my case, I've 14 languages, out of which 50% of the documents belong to English. German and CHS will probably constitute another 25%. I'm not using copyfield, rather, each language has it's dedicated field such as title_enu, text_enu, title_ger,text_ger, etc. Since I know the language prior to index time, this works for, me. I've added one more sample key in the example. ENU!12345!www.testurl.com/enu/doc1 ENU!12345!www.testurl.com/enu/doc10 GER!12345!www.testurl.com/ger/doc2 CHS!67890!www.testurl.com/chs/doc3 As you can see, there are 2 documents in english having same topic id (12345). I added topicid as part of the key to make sure that they are residing in the same shard in order to make field collapsing work on topic id. I can perhaps remove the composite key and only have language and url, something like, ENU!www.testurl.com/enu/doc1 But that'll probably not solve the distribution issue. You mentioned when you take over routing, making sure the distribution is even is now your responsibility. I'm wondering, what's the best practice to make it happen ? I can get away from composite router and manually assign a bunch of language to a dedicated shard, both during index and query time. But I'm not sure keeping a map is an efficient way of dealing with it. -- View this message in context: http://lucene.472066.n3.nabble.com/Uneven-index-distribution-using-composite-router-tp4195569p4195591.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to create a core by API?
On Thu, Mar 26, 2015 at 1:45 PM, Mark E. Haase meha...@gmail.com wrote: I'm not saying you're wrong. The configSet parameter doesn't work at all in my set up, so you might be right... I'm just wondering where that's documented. Trying on current trunk, I got it to work: /opt/code/lusolr_trunk/solr$ curl -XPOST http://localhost:8983/solr/admin/cores?action=CREATEname=demo3instanceDir=demo3configSet=basic_configs; ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime769/int/lststr name=coredemo3/str /response Although I'm not thrilled with a different parameter name for cloud vs non-cloud. I come from the camp that believes that overloading is both natural and easily understood (e.g. I don't find foo + bar and 1.5 + 2.5 both using + confusing). -Yonik
Re: Build index from Oracle, adding fields
On 27/03/2015 12:42, Shawn Heisey wrote: If that's not practical, then the only real option you have is to drop back to one entity, and build a single SELECT statement (using JOIN and some form of CONCAT) that will gather all the information from all the tables at the same time, and combine multiple values together into one SQL result field with some kind of delimiter. Then you can use the RegexTransformer's splitBy functionality to turn the concatenated data back into multiple values for your multi-valued field. Database servers tend to be REALLY good at JOIN operations, so the database would be doing the heavy lifting. I did try that in fact (and do it with one of my other indexes). However, with this index the sub-select can return 200 rows of 200 characters - and that blows up in Oracle as the field is over 4000 characters long (and the work-around for that is to use clob's - but that has its own performance problems). Currently I am doing this by exporting a CSV file and processing it with a C program - and then reading the CSV with SOLR :( -- Cheers Jules.
Re: i'm a newb: questions about schema.xml
This is key: managed-schema You've managed to get things started with the managed schema. Therefore, you need to use the REST API to add/subtract/multiply/divide. This is different than schemaless, although it _is_ related. And they're both different than having a schema.xml to edit. Or start over _without_ a managed schema, not quite sure how you started that way in the first place ;). You may have used bin/solr start -e schemaless when you started and maybe forgot? Here's a place to start: https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig Best, Erick On Thu, Mar 26, 2015 at 4:41 PM, Mark Bramer mbra...@esri.com wrote: Hi Shawn, Definitely helpful to know about the instance and files stuff in Admin. I'm not running cloud, so I looked in the /conf directory but there's no schema.xml: Here's what's in my core's Files: currency.xml elevate.xml lang params.json protwords.txt solrconfig.xml stopwords.txt synonyms.txt and echoed by ls -l: -rw-r--r-- 1 root root 3974 Feb 15 11:38 currency.xml -rw-r--r-- 1 root root 1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root root 4096 Mar 23 10:46 lang -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema -rw-r--r-- 1 root root 308 Feb 15 11:38 params.json -rw-r--r-- 1 root root 873 Feb 15 11:38 protwords.txt -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml -rw-r--r-- 1 root root 781 Feb 15 11:38 stopwords.txt -rw-r--r-- 1 root root 1119 Feb 15 11:38 synonyms.txt -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 26, 2015 7:28 PM To: solr-user@lucene.apache.org Subject: Re: i'm a newb: questions about schema.xml On 3/26/2015 4:57 PM, Mark Bramer wrote: I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Hopefully you know what core you are using, so you can go to the admin UI and find it in the Core Selector dropdown list. Assuming you can do that, you will find yourself looking at the Overview tab for that core. https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface Once you are looking at the core overview, in the upper right corner of your browser window is a section called Instance ... which has an entry that is ALSO called Instance. Inside the directory indicated by that field, you should have a conf directory. The config and schema for that index are found in that conf directory. If you're running SolrCloud, then you can forget everything I just said ... the active configs will be found within the zookeeper database, and you can use the Cloud-Tree tab in the admin UI to find your collections and see which configName is linked to each one. You'll want to become familiar with the zkcli script
RE: i'm a newb: questions about schema.xml
Pretty sure I found what I am looking for: https://cwiki.apache.org/confluence/display/solr/Managed+Schema+Definition+in+SolrConfig I noticed the managed-schema file and a couple Google searches with that finally landed me at that link. Interesting that the file is hidden from the Files list in the Admin UI. Thanks! -Original Message- From: Mark Bramer Sent: Thursday, March 26, 2015 7:42 PM To: 'solr-user@lucene.apache.org' Subject: RE: i'm a newb: questions about schema.xml Hi Shawn, Definitely helpful to know about the instance and files stuff in Admin. I'm not running cloud, so I looked in the /conf directory but there's no schema.xml: Here's what's in my core's Files: currency.xml elevate.xml lang params.json protwords.txt solrconfig.xml stopwords.txt synonyms.txt and echoed by ls -l: -rw-r--r-- 1 root root 3974 Feb 15 11:38 currency.xml -rw-r--r-- 1 root root 1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root root 4096 Mar 23 10:46 lang -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema -rw-r--r-- 1 root root 308 Feb 15 11:38 params.json -rw-r--r-- 1 root root 873 Feb 15 11:38 protwords.txt -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml -rw-r--r-- 1 root root 781 Feb 15 11:38 stopwords.txt -rw-r--r-- 1 root root 1119 Feb 15 11:38 synonyms.txt -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 26, 2015 7:28 PM To: solr-user@lucene.apache.org Subject: Re: i'm a newb: questions about schema.xml On 3/26/2015 4:57 PM, Mark Bramer wrote: I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Hopefully you know what core you are using, so you can go to the admin UI and find it in the Core Selector dropdown list. Assuming you can do that, you will find yourself looking at the Overview tab for that core. https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface Once you are looking at the core overview, in the upper right corner of your browser window is a section called Instance ... which has an entry that is ALSO called Instance. Inside the directory indicated by that field, you should have a conf directory. The config and schema for that index are found in that conf directory. If you're running SolrCloud, then you can forget everything I just said ... the active configs will be found within the zookeeper database, and you can use the Cloud-Tree tab in the admin UI to find your collections and see which configName is linked to each one. You'll want to become familiar with the zkcli script in server/scripts/cloud-scripts. https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities Whether it is SolrCloud or not, you can always LOOK at your configs right in the admin UI -- click on the
Re: Custom TokenFilter
Hi Erick, For me, this classCastException is caused by the wrong use of TokenFilter.In fieldType declaration (schema.xml), i've put :tokenizer class=com.tamingtext.texttamer.solr.SentenceTokenizerFactory/And instead using TokenizerFactory in my class, i utilize TokenFilterFactory like this :public class SentenceTokenizerFactory extends TokenFilterFactory So when solr try to load my class, it expects to load TokenizerFactory class but it has TokenFilterFactory class. Regards,Andry Le Jeudi 26 mars 2015 4h13, Erick Erickson erickerick...@gmail.com a écrit : Thanks for letting us know the resolution, the problem was bugging me Erick On Wed, Mar 25, 2015 at 4:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Finally, i think i found where this problem comes.I didn't use the right class extender, instead using Tokenizers, i'm using Token filter. Eric, thanks for your replies.Regards. Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit : Re, I have tried to remove all the redundant jar files.Then i've relaunched it but it's blocked directly on the same issue. It's very strange. Regards, Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a écrit : Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in your aggregate classpath exactly once. Having them in twice would explain the cast exception. not need these in the tamingtext/dependency subdirectory, just the things that are _not_ in Solr already.. Best, Erick On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have
SolrCloud -- Blocking access to administration commands while keeping the solr internal communication
Hello there, There are many blogs discussing this issue but it is hard to find if someone had managed to resolve that. We have many nodes in the SolrCloud, implementing the iptable restriction will fill the iptable with many rules that will affect performance. We are using 4.3.10, on Tomcat 5.
Re: i'm a newb: questions about schema.xml
On 3/26/2015 4:57 PM, Mark Bramer wrote: I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Hopefully you know what core you are using, so you can go to the admin UI and find it in the Core Selector dropdown list. Assuming you can do that, you will find yourself looking at the Overview tab for that core. https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface Once you are looking at the core overview, in the upper right corner of your browser window is a section called Instance ... which has an entry that is ALSO called Instance. Inside the directory indicated by that field, you should have a conf directory. The config and schema for that index are found in that conf directory. If you're running SolrCloud, then you can forget everything I just said ... the active configs will be found within the zookeeper database, and you can use the Cloud-Tree tab in the admin UI to find your collections and see which configName is linked to each one. You'll want to become familiar with the zkcli script in server/scripts/cloud-scripts. https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities Whether it is SolrCloud or not, you can always LOOK at your configs right in the admin UI -- click on the Files tab after you select the core from the selector. Thanks, Shawn
Re: Build index from Oracle, adding fields
On 3/26/2015 5:19 PM, Julian Perry wrote: I have an index with, say, 10 fields. I load that index directly from Oracle - data-config.xml using JDBC. I can load 10 million rows very quickly. This direct way of loading from Oracle straight into SOLR is fantastic - really efficient and saves writing loads of import/export code (e.g. via a CSV file). Of those 10 fields - two of them (set to multiValued) come from a separate table and there are anything from 1 to 10 rows per row from the main table. I can use a nested entity to extract the child rows for each of the 10m rows in the main table - but then SOLR generates 10m separate SQL calls - and the load time goes from a few minutes to several days. On smaller tables - just a few thousand rows - I can use a second nested entity with a JDBC call - but not for very large tables. Could I load the data in two steps: 1) load the main 10m rows 2) load into the existing index by adding the data from a second SQL call into fields for each existing row (i.e. an UPDATE instead of an INSERT). I don't know what syntax/option might achieve that. There is incremental loading - but I think that replaces whole rows rather then updating individual fields. Or maybe it does do both? If those child tables do not have a large number of entries, you can configure caching on the inner entities so that the information doesn't need to actually be requested from the database server. If there are a large number of entries, then that may not be possible due to memory constraints. https://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor If that's not practical, then the only real option you have is to drop back to one entity, and build a single SELECT statement (using JOIN and some form of CONCAT) that will gather all the information from all the tables at the same time, and combine multiple values together into one SQL result field with some kind of delimiter. Then you can use the RegexTransformer's splitBy functionality to turn the concatenated data back into multiple values for your multi-valued field. Database servers tend to be REALLY good at JOIN operations, so the database would be doing the heavy lifting. https://wiki.apache.org/solr/DataImportHandler#RegexTransformer Solr does have an equivalent concept to SQL's UPDATE, but there are enough caveats to using it that it may not be a good option: https://wiki.apache.org/solr/Atomic_Updates Thanks, Shawn
RE: i'm a newb: questions about schema.xml
Hi Shawn, Definitely helpful to know about the instance and files stuff in Admin. I'm not running cloud, so I looked in the /conf directory but there's no schema.xml: Here's what's in my core's Files: currency.xml elevate.xml lang params.json protwords.txt solrconfig.xml stopwords.txt synonyms.txt and echoed by ls -l: -rw-r--r-- 1 root root 3974 Feb 15 11:38 currency.xml -rw-r--r-- 1 root root 1348 Feb 15 11:38 elevate.xml drwxr-xr-x 2 root root 4096 Mar 23 10:46 lang -rw-r--r-- 1 root root 29733 Mar 23 18:04 managed-schema -rw-r--r-- 1 root root 308 Feb 15 11:38 params.json -rw-r--r-- 1 root root 873 Feb 15 11:38 protwords.txt -rw-r--r-- 1 root root 60591 Feb 15 11:38 solrconfig.xml -rw-r--r-- 1 root root 781 Feb 15 11:38 stopwords.txt -rw-r--r-- 1 root root 1119 Feb 15 11:38 synonyms.txt -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 26, 2015 7:28 PM To: solr-user@lucene.apache.org Subject: Re: i'm a newb: questions about schema.xml On 3/26/2015 4:57 PM, Mark Bramer wrote: I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Hopefully you know what core you are using, so you can go to the admin UI and find it in the Core Selector dropdown list. Assuming you can do that, you will find yourself looking at the Overview tab for that core. https://cwiki.apache.org/confluence/display/solr/Using+the+Solr+Administration+User+Interface Once you are looking at the core overview, in the upper right corner of your browser window is a section called Instance ... which has an entry that is ALSO called Instance. Inside the directory indicated by that field, you should have a conf directory. The config and schema for that index are found in that conf directory. If you're running SolrCloud, then you can forget everything I just said ... the active configs will be found within the zookeeper database, and you can use the Cloud-Tree tab in the admin UI to find your collections and see which configName is linked to each one. You'll want to become familiar with the zkcli script in server/scripts/cloud-scripts. https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities Whether it is SolrCloud or not, you can always LOOK at your configs right in the admin UI -- click on the Files tab after you select the core from the selector. Thanks, Shawn
i'm a newb: questions about schema.xml
Hello, I'm a Solr newb. I've been poking around for several days on my own test instance, and also online at the info available. But one thing just isn't jiving and I can't put my finger on why. I've searched many many times but I don't see what I'm looking for, so I'm thinking perhaps I have a fundamental semantic misunderstanding of something somewhere. Everywhere I read, everyone talks about schema.xml and how important is. I fully get what it's for but I don't get where it is, how it's used (by me), how I edit it, and how I create new indexes once I've edited it. I've installed, and am successfully running, solr 5.0.0 on Linux. I've followed the widely recommended-by-all quick start at: http://lucene.apache.org/solr/quickstart.html. I get through it fine, I post a bunch of stuff, I use the web UI to query for, and see, data I would expect to see. Should I now have a schema.xml file somewhere that is somehow connected to my new index? If so, where is it? Was it present from install or did it get created when I made my first core (bin/solr create -c ati_docs)? [root@machine solr-5.0.0]# find -name schema.xml ./example/example-DIH/solr/tika/conf/schema.xml ./example/example-DIH/solr/rss/conf/schema.xml ./example/example-DIH/solr/solr/conf/schema.xml ./example/example-DIH/solr/db/conf/schema.xml ./example/example-DIH/solr/mail/conf/schema.xml ./server/solr/configsets/basic_configs/conf/schema.xml ./server/solr/configsets/sample_techproducts_configs/conf/schema.xml [root@machine solr-5.0.0]# Is it the one in /configsets/basic_configs/conf? Is that the default one? If I want to 'modify' schema.xml to do some different indexing/analyzing, how do I start? Make a copy of that schema.xml, move it somewhere else and modify it? If so, how do I create a new index using this schema.xml? Or am I running in schemaless mode? I don't think I am because it appears that I would have to specifically state this as a command line parameter, i.e. bin/solr start -e schemaless What fundamentals am I missing? I'm coming to Solr from Elasticsearch, and I've already recognized some differences. Is my ES background clouding my grasp of Solr fundamentals? Thanks for any help. Mark Bramer | Technical Team Lead, DC Services Esri | 8615 Westwood Center Dr | Vienna, VA 22182 | USA T 703 506 9515 x8017 | mbra...@esri.commailto:mbra...@esri.com | esri.com
Re: SolrCloud -- Blocking access to administration commands while keeping the solr internal communication
On 3/26/2015 3:38 PM, Oded Sofer wrote: There are many blogs discussing this issue but it is hard to find if someone had managed to resolve that. We have many nodes in the SolrCloud, implementing the iptable restriction will fill the iptable with many rules that will affect performance. We are using 4.3.10, on Tomcat 5. Because Solr is a webapp, it relies on software outside itself to provide network and protocol (HTTP) communication. In your case, that software is Tomcat. For others, it is Jetty, JBoss, Weblogic, or one of several other possibilities. This means that there are many things that are impossible (or extremely difficult) for Solr to handle within its own code. Security is one of them. This is one of the major reasons that Solr will become a true application at some point in the future. When Solr can control the network and the HTTP server, we will be able to restrict access to the admin UI separately from access to the query interface, the update interface, replication, etc. As far as your iptables rule list ... are your Solr servers contained within discrete IP address blocks that could be added to the rule list as subnets instead of individual addresses? Ideally you will handle complicated access controls on edge firewalls or as ACLs on internal routing devices, not at the host level. Thanks, Shawn
Build index from Oracle, adding fields
Hi I have looked and cannot see any clear answers to this on the Interwebs. I have an index with, say, 10 fields. I load that index directly from Oracle - data-config.xml using JDBC. I can load 10 million rows very quickly. This direct way of loading from Oracle straight into SOLR is fantastic - really efficient and saves writing loads of import/export code (e.g. via a CSV file). Of those 10 fields - two of them (set to multiValued) come from a separate table and there are anything from 1 to 10 rows per row from the main table. I can use a nested entity to extract the child rows for each of the 10m rows in the main table - but then SOLR generates 10m separate SQL calls - and the load time goes from a few minutes to several days. On smaller tables - just a few thousand rows - I can use a second nested entity with a JDBC call - but not for very large tables. Could I load the data in two steps: 1) load the main 10m rows 2) load into the existing index by adding the data from a second SQL call into fields for each existing row (i.e. an UPDATE instead of an INSERT). I don't know what syntax/option might achieve that. There is incremental loading - but I think that replaces whole rows rather then updating individual fields. Or maybe it does do both? Any other techniques that would be fast/efficient? Help! -- Cheers Jules.
Re: Solr Monitoring - Stored Stats?
Matt, SPM will give you all that out of the box with alerts, anomaly detection etc. See http://sematext.com/spm Otis On Mar 25, 2015, at 11:26, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. Thanks, Matt
Re: Data indexing is going too slow on single shard Why?
Great thanks Shawn... As you said - **For 204GB of data per server, I recommend at least 128GB of total RAM, preferably 256GB**. Therefore, if I have 204GB of data on single server/shard then I prefer is 256GB by which searching will be fast and never slow down. Is it? On Wed, Mar 25, 2015 at 9:50 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/25/2015 8:42 AM, Nitin Solanki wrote: Server configuration: 8 CPUs. 32 GB RAM O.S. - Linux snip are running. Java heap set to 4096 MB in Solr. While indexing, snip *Currently*, I have 1 shard with 2 replicas using SOLR CLOUD. Data Size: 102Gsolr/node1/solr/wikingram_shard1_replica2 102Gsolr/node2/solr/wikingram_shard1_replica1 If both of those are on the same machine, I'm guessing that you're running two Solr instances on that machine, so there's 8GB of RAM used for Java. That means you have about 24 GB of RAM left for caching ... and 200GB of index data to cache. 24GB is not enough to cache 200GB of index. If there is only one Solr instance (leaving 28GB for caching) with 102GB of data on the machine, it still might not be enough. See that SolrPerformanceProblems wiki page I linked in my earlier email. For 102GB of data per server, I recommend at least 64GB of total RAM, preferably 128GB. For 204GB of data per server, I recommend at least 128GB of total RAM, preferably 256GB. Thanks, Shawn
Re: Data indexing is going too slow on single shard Why?
On 3/26/2015 12:03 AM, Nitin Solanki wrote: Great thanks Shawn... As you said - **For 204GB of data per server, I recommend at least 128GB of total RAM, preferably 256GB**. Therefore, if I have 204GB of data on single server/shard then I prefer is 256GB by which searching will be fast and never slow down. Is it? Obviously I cannot guarantee it, but I think it's extremely likely that with that much memory, performance will be very good. One other possibility, which is discussed on that wiki page I linked, is that your java heap is being almost exhausted and large amounts of time are spent in garbage collection. If you increase the heap from 4GB to 5GB and see performance get better, then that would be confirmed. There would be less memory available for caching, but constant garbage collection would be a much greater problem than the disk cache being too small. Thanks, Shawn
Re: Applying Tokenizers and Filters to CopyFields
Thanks so much, Erick and Michael, for all the additional explanation. The crucial information in the end turned out to be the one about the Default Search Field („df“). In solrconfig.xml this parameter was to point to the original text, which is why the expanded queries didn’t work. When I set the df parameter to one of the fields with the expanded text, the search works fine. I have also removed the copyField declarations. It’s all working as expected now. Thanks again for the help. Cheers, Martin Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com: Martin: Perhaps this would help indexed=true, stored=true field can be searched. The raw input (not analyzed in any way) can be shown to the user in the results list. indexed=true, stored=false field can be searched. However, the field can't be returned in the results list with the document. indexed=false, stored=true The field cannot be searched, but the contents can be returned in the results list with the document. There are some use-cases where this is desirable behavior. indexed=false, stored=false The entire field is thrown out, it's just as if you didn't send the field to be indexed at all. And one other thing, the copyField gets the _raw_ data not the analyzed data. Let's say you have two fields, src and dst. copying from src to dest in schema.xml is identical to add doc field name=srcoriginal text/field field name=dstoriginal text/field /doc /add that is, copyfield directives are not chained. Also, watch out for your query syntax. Michael's comments are spot-on, I'd just add this: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true is kind of odd. Let's assume you mean qf rather than fq. That _only_ matters if your query parser is edismax, it'll be ignored in this case I believe. You'd want something like q=src:Sprache or q=dst:Sprache or even http://localhost:8983/solr/windex/select?q=Sprachedf=src http://localhost:8983/solr/windex/select?q=Sprachedf=dst where df is default field and the search is applied against that field in the absence of a field qualification like my first two examples. Best, Erick On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least, not easily). If what you really want is to store and retrieve case folded versions of your data as well as the original, you need to use something like a UpdateRequestProcessor, which I personally am less familiar with. On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net wrote: So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/
RE: Solr Monitoring - Stored Stats?
Erick, Shawn, Thanks for your responses. I figured this was the case, just wanted to check to be sure. I have used Zabbix to configure JMX points to monitor over time, but it was a bit of work to get configured. We are looking to create a simple dashboard of a few stats over time. Looks like the easiest approach will be to make an app to make calls for these stats at a regular interval and then index results to Solr, and then we will able to query over desired time frames... Thanks, Matt -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 25, 2015 10:30 AM To: solr-user@lucene.apache.org Subject: Re: Solr Monitoring - Stored Stats? Matt: Not really. There's a bunch of third-party log analysis tools that give much of this information (not everything exposed by JMX of course is in the log files though). Not quite sure whether things like Nagios, Zabbix and the like have this kind of stuff built in seems like a natural extension of those kinds of tools though Not much help here... Erick On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. Thanks, Matt
Re: Applying Tokenizers and Filters to CopyFields
Glad you are sorted out! Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Thu, Mar 26, 2015 at 10:09 AM, Martin Wunderlich martin...@gmx.net wrote: Thanks so much, Erick and Michael, for all the additional explanation. The crucial information in the end turned out to be the one about the Default Search Field („df“). In solrconfig.xml this parameter was to point to the original text, which is why the expanded queries didn’t work. When I set the df parameter to one of the fields with the expanded text, the search works fine. I have also removed the copyField declarations. It’s all working as expected now. Thanks again for the help. Cheers, Martin Am 25.03.2015 um 23:43 schrieb Erick Erickson erickerick...@gmail.com: Martin: Perhaps this would help indexed=true, stored=true field can be searched. The raw input (not analyzed in any way) can be shown to the user in the results list. indexed=true, stored=false field can be searched. However, the field can't be returned in the results list with the document. indexed=false, stored=true The field cannot be searched, but the contents can be returned in the results list with the document. There are some use-cases where this is desirable behavior. indexed=false, stored=false The entire field is thrown out, it's just as if you didn't send the field to be indexed at all. And one other thing, the copyField gets the _raw_ data not the analyzed data. Let's say you have two fields, src and dst. copying from src to dest in schema.xml is identical to add doc field name=srcoriginal text/field field name=dstoriginal text/field /doc /add that is, copyfield directives are not chained. Also, watch out for your query syntax. Michael's comments are spot-on, I'd just add this: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true is kind of odd. Let's assume you mean qf rather than fq. That _only_ matters if your query parser is edismax, it'll be ignored in this case I believe. You'd want something like q=src:Sprache or q=dst:Sprache or even http://localhost:8983/solr/windex/select?q=Sprachedf=src http://localhost:8983/solr/windex/select?q=Sprachedf=dst where df is default field and the search is applied against that field in the absence of a field qualification like my first two examples. Best, Erick On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least, not easily). If what you really want is to store and retrieve case folded versions of your data as well as the original, you need to use something like a UpdateRequestProcessor, which I personally am less familiar with. On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net wrote: So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/
Replacing a group of documents (Delete/Insert) without a query on the index ever showing an empty list (Docs)
Hi, I have an index which is made up of groups of documents, each group is defined by a field called keyField (keyField:A). I need to delete all the keyField:A documents and replace them with a brand new set without the index ever returning zero documents on a query. At the moment I deleteByQuery:keyField:A and then insert a SolrInputDocument list via SolrJ into my index. I have a small time period where somebody doing a q=fieldKey:A can be returned an empty list. FYI: The keyField group might be just 100 documents or up to 10 million. Any help much appreciated. Thanks Russ. Index example docs: [ { keyField:A ... lastField:xyz }, { keyField:A ... lastField:xyz }, { keyField:B ... lastField:xyz }, { keyField:A ... lastField:xyz }, { keyField:B ... lastField:xyz } *** This message (including any files transmitted with it) may contain confidential and/or proprietary information, is the property of Interactive Data Corporation and/or its subsidiaries, and is directed only to the addressee(s). If you are not the designated recipient or have reason to believe you received this message in error, please delete this message from your system and notify the sender immediately. An unintended recipient's disclosure, copying, distribution, or use of this message or any attachments is prohibited and may be unlawful. ***