Solr performance improved under heavy load
I run a small solr cloud cluster (4.5) of 3 nodes, 3 collections with 3 shards each. Total index size per node is about 20GB with about 70M documents. In regular traffic (27-50 rpm) the performance is ok and response time ranges from 100 to 500ms. But when I start loading (overwriting) 70M documents again via curl + csv, the performance has drastically improved. I see a 6ms response time (screenshot attached). So I am just curious about this, intuitively solr should perform better under low traffic and slow down as traffic goes up. So what is the reason for this? Efficient memory management with more data? -- Thanks, -Utkarsh
Denormalize or use multivalued field for nested data?
I have to modify a schema where I can attach nested pricing per store information for a product. For example: 10010137332:{ title:iPad 64gb description: iPad 64gb with retina pricing:{ merchantid64354:{ locationid643:{ USD|600 } locationid6436:{ USD|600 } } merchantid343:{ locationid1345:{ USD|600 } locationid4353:{ USD|600 } } } } This is what is suggested all over the internet: Denormalize it: In my case, I will end up with total number of columns = total locations with a price which is about 100k. I don't think having 100k columns for 60M products is a good idea. Are there any better ways of handling this? I am trying to figure out multivalue field but as far as I understand it, it can only be used as a flag but cannot be used to get a value associated to a key. Based on this answer, solr 4.5+ supports nested documents: http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4. -- Thanks, -Utkarsh
Investigating performance issues in solr cloud
I see sudden drop in throughput once every 3-4 days. The downtime is for about 2-6minutes and things stabilize after that. But I am not sure what is causing it the problem. I have 3 shards with 20GB of data on each shard. Solr dashboard: http://i.imgur.com/6RWT2Dj.png Newrelic graphs when during the downtime of about 4hours: http://i.imgur.com/9vhKiB2.png JVM memory graph says its normal: http://i.imgur.com/pAycgdC.png I thought it was GC pauses but it should be in the newrelic logs. How can I go about investigating this problem? I am running solr 4.4.0, I don't see a strong reason to upgrade yet. -- Thanks, -Utkarsh
Re: Investigating performance issues in solr cloud
Lots of questions indeed :) 1. Total virtual machines: 3 2. Replication factor: 0 (don't have any replicas yet) 3. Each machine has 1 shard which has 20GB of data. So data for a collection is spread across 3 machines totalling to 60GB 4. Start solr: java -Xmx1m -javaagent:newrelic/newrelic.jar -Dsolr.clustering.enabled=true -Dsolr.solr.home=multicore -Djetty.class.path=lib/ext/* -Dbootstrap_conf=true -DnumShards=3 -DzkHost=localhost:2181 -jar start.jar 5. Yes, all machines have 24GB RAM and 9GB heap. Separate process of ZK is running on these machine. 6. top screenshot: http://i.imgur.com/g6w9Bim.png Thanks! On Tue, Apr 8, 2014 at 4:48 PM, Shawn Heisey s...@elyograg.org wrote: On 4/8/2014 5:30 PM, Utkarsh Sengar wrote: I see sudden drop in throughput once every 3-4 days. The downtime is for about 2-6minutes and things stabilize after that. But I am not sure what is causing it the problem. I have 3 shards with 20GB of data on each shard. Solr dashboard: http://i.imgur.com/6RWT2Dj.png Newrelic graphs when during the downtime of about 4hours: http://i.imgur.com/9vhKiB2.png JVM memory graph says its normal: http://i.imgur.com/pAycgdC.png I thought it was GC pauses but it should be in the newrelic logs. How can I go about investigating this problem? I am running solr 4.4.0, I don't see a strong reason to upgrade yet. Lots of questions: How many total machines? What is your replicationFactor? Does each machine have one shard replica and therefore 20GB of total index data, or if you add up all the index directories for the cores on each machine, is there more than 20GB of data? What options are you passing to your JVM when you start the servlet container that runs Solr? The dashboard says that this machine has 24GB of RAM and a 9GB heap. Is this the case for all machines? Is there any software other than Solr on the machine? If it's a linux/unix machine, can you run top, press shift-M to sort by memory, and grab a screenshot? If it's a Windows machine, a similar list should be available in the task manager, but it must include all processes for all users on the whole machine, and it would be best if it showed virtual memory as well as private. Thanks, Shawn -- Thanks, -Utkarsh
Re: Investigating performance issues in solr cloud
1. I am using Oracle JVM user@host:~$ java -version java version 1.6.0_45 Java(TM) SE Runtime Environment (build 1.6.0_45-b06) Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode) 2. I will try out jHiccup and your GC settings. 3. Yes, I am running ZK instances in an ensemble. I didn't know I need to pass all the instances of ZK to a single solr node. I will try it out right now. This means if you have a large cluster say of 50 solr nodes and 10 ZK nodes then I will need to pass all the 10 nodes to -DzkHost of the 50 solr processes? What is the reasoning behind this? Thanks, -Utkarsh On Tue, Apr 8, 2014 at 5:37 PM, Shawn Heisey s...@elyograg.org wrote: On 4/8/2014 6:00 PM, Utkarsh Sengar wrote: Lots of questions indeed :) 1. Total virtual machines: 3 2. Replication factor: 0 (don't have any replicas yet) 3. Each machine has 1 shard which has 20GB of data. So data for a collection is spread across 3 machines totalling to 60GB 4. Start solr: java -Xmx1m -javaagent:newrelic/newrelic.jar -Dsolr.clustering.enabled=true -Dsolr.solr.home=multicore -Djetty.class.path=lib/ext/* -Dbootstrap_conf=true -DnumShards=3 -DzkHost=localhost:2181 -jar start.jar 5. Yes, all machines have 24GB RAM and 9GB heap. Separate process of ZK is running on these machine. 6. top screenshot: http://i.imgur.com/g6w9Bim.png A followup question: What vendor and version of JVM are you running? Excellent choices include very recent Java 6 releases from Oracle, Oracle Java 7u25, and whatever OpenJDK version corresponds to Oracle 7u25. Good choices include most version of Oracle Java 7, Oracle Java 6, and OpenJDK7. The latest versions of Oracle Java 7 (from 7u40 to 7u51) have known bugs that affect Solr. OpenJDK6 and commercial java versions from non-Oracle vendors like IBM are very bad choices, because they have known serious bugs. I don't know much about the Zing JVM, but it is probably a good choice. If you are running Zing, then what I'm saying below about GC pauses will not apply. Solr 4.8 will require Java 7, so if you plan to upgrade that far, be sure you're not using Java 6 at all. One possible problem that I always investigate first is whether or not there's enough RAM to cache the index effectively. The 14GB of RAM in your disk cache is not a perfect setup for a 20GB index, but it should be plenty. The fact that you still have 4GB of RAM free on your top screenshot is further evidence that you do have plenty of disk cache. No need to pursue that any further. Garbage collection pauses are however a likely problem here. I have some personal experience with this problem. Because you're using the default collector and have 7GB heap allocated, I can almost guarantee that this is a problem, even if New Relic isn't showing it. A program called jHiccup *will* show the problem. http://www.azulsystems.com/jHiccup These are my GC settings. They work very well and are not specific to a certain heap size, although I am sure that the config can be improved: http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning Regarding zookeeper: Are you running all three of your ZK instances in a redundant ensemble, where the config on each of them knows about all of them? You should definitely be doing this. If you are, then your zkHost parameter for Solr needs to reflect that: -DzkHost=host1:2181,host2:2181,host3:2181 Using only localhost:2181 could cause problems, and they could look like the problems you are seeing. Thanks, Shawn -- Thanks, -Utkarsh
Re: implement relevency
Hi Rashmi, Relevancy needs some kind of training data which can lead to a chicken and egg problem. If you dont have that training set, then you need to come up with it or train manually (provide some seed). Our existing search had 2 years worth clickstream data, i.e. we know if someone searches for ipod they clicked on a UPC which was an iPod 4th gen or an iPod 5th gen 32GB etc. So, we have used that data to build an internal lookup table of millions of queries which look something like this: ipod 32gb - music^1000, apple^1000, 32gb^991, 8gb^800 We wrote an algorithm which computes the keyword relevancy score which is used as the boost value. Now, when a query like ipod 32gb comes in, we lookup this table, get the boost values and query solr with these boost values and its score. We are happy with the results. Our usecase was product search (title+description) of about 60M documents, not sure how will this approach work with a different usecase. Thanks, -Utkarsh On Tue, Jan 28, 2014 at 9:22 AM, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in wrote: You may also want to look here http://wiki.apache.org/solr/SolrRelevancyFAQ -- View this message in context: http://lucene.472066.n3.nabble.com/implement-relevency-tp4113964p4113983.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks, -Utkarsh
Re: Complex nested structure in solr
Bumping this one, with an update: After thinking about this, I think getting rid of lat/lon will simplify things a bit, the new query pattern: Input: keyword=ipod, merchantId=922,locationId=81,82 Output: List of UPCs for ipod which exist inside stores 81 and 82 which should be owned by 922 Also, based on some previous answers, flattening this one using dynamic fields will create a lot of fields in my case - one field for every merchantId and then I can use multivalued field to store locationId for merchants. But is there a cleaner way of implementing this? Example: upc,merchantid_922,merchantid_800 892828282,[82,82], 922932932,,[22,23] http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201208.mbox/%3CCAEFAe-Hew1CKk=EyqACFUTKqGHExXZLSHtyrgym09aYQVJf=t...@mail.gmail.com%3E Thanks, -Utkarsh On Fri, Jan 24, 2014 at 12:05 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hi guys, I have to load extra meta data to an existing collection. This is what I am looking for: For a UPC: Store availability by merchantId per location (which has lat/lon) My query pattern will be: Given a keyword, find all available products for a merchantId around the given lat/lon. Example: Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333 Output: List of UPCs which match the criteria So how should I go about doing it? Any suggestions? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Complex nested structure in solr
Hi guys, I have to load extra meta data to an existing collection. This is what I am looking for: For a UPC: Store availability by merchantId per location (which has lat/lon) My query pattern will be: Given a keyword, find all available products for a merchantId around the given lat/lon. Example: Input: keyword=ipod, merchantId=922,lat/lon=28.222,82.333 Output: List of UPCs which match the criteria So how should I go about doing it? Any suggestions? -- Thanks, -Utkarsh
shard merged into a another shard as replica
I am not sure what happened, I updated merchant collection and then restarted all the solr machines. This is what I see right now: http://i.imgur.com/4bYuhaq.png merchant collection looks fine. But deals and prodinfo collections should have a total of 3 shards. But someone shard1 has converted to replica of shard2. This is running in production, so how can I fix it without dumping the whole zk data? -- Thanks, -Utkarsh
Re: shard merged into a another shard as replica
solr 4.4.0 On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller markrmil...@gmail.com wrote: What version of Solr are you running? - Mark On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I am not sure what happened, I updated merchant collection and then restarted all the solr machines. This is what I see right now: http://i.imgur.com/4bYuhaq.png merchant collection looks fine. But deals and prodinfo collections should have a total of 3 shards. But someone shard1 has converted to replica of shard2. This is running in production, so how can I fix it without dumping the whole zk data? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: shard merged into a another shard as replica
Thanks Mark. I tried updating clusterstate manually, things went haywire J. So to fix it, had to take 30secs-1min downtime where I stopped solr and zk, deleted /zookeeper_data/version-2 directory and restarted everything again. I have auotmated these commands via fabric, so was easily able to recover from the downtime. Thanks, -Utkarsh On Wed, Jan 22, 2014 at 3:18 PM, Mark Miller markrmil...@gmail.com wrote: Hopefully an issue that has been fixed then. We should look into that. You should be able to fix it by directly modifying the clusterstate.json in ZooKeeper. Remember to back it up first! There are a variety of tools you can use to work with ZooKeeper - I like the eclipse plug-in that you can google for. Many, many SolrCloud bug fixes (we are about to release 4.6.1) since 4.4, so you might consider an upgrade if possible at some point soon. - Mark On Jan 22, 2014, 6:14:10 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: solr 4.4.0 On Wed, Jan 22, 2014 at 3:12 PM, Mark Miller markrmil...@gmail.com wrote: What version of Solr are you running? - Mark On Jan 22, 2014, 5:42:30 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I am not sure what happened, I updated merchant collection and then restarted all the solr machines. This is what I see right now: http://i.imgur.com/4bYuhaq.png merchant collection looks fine. But deals and prodinfo collections should have a total of 3 shards. But someone shard1 has converted to replica of shard2. This is running in production, so how can I fix it without dumping the whole zk data? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Trigger event on change of a field in a document
I am experimenting with implementing a price drop feature. Can I register some document's fields and trigger some sort of events if the values change in those fields? For example: 1. Price of itemX is $10 2. Say the price changes to $17 or $5 (increases or decreases) when the new data loads. 3. Trigger an event to take an action on that change, like send out an email. I believe this is somewhat similar but not the same as the percolator feature in elasticsearch. -- Thanks, -Utkarsh
Re: Trigger event on change of a field in a document
Thanks! I think, I will explore how to implement it outside solr. -Utkarsh On Fri, Dec 27, 2013 at 3:20 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: And if you really really really wanted that in Solr then have a look at UpdateRequestProcessors. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 27, 2013 6:19 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, This sounds like it would be best implemented outside the search engine. Otis Solr ElasticSearch Support http://sematext.com/ On Dec 27, 2013 4:29 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I am experimenting with implementing a price drop feature. Can I register some document's fields and trigger some sort of events if the values change in those fields? For example: 1. Price of itemX is $10 2. Say the price changes to $17 or $5 (increases or decreases) when the new data loads. 3. Trigger an event to take an action on that change, like send out an email. I believe this is somewhat similar but not the same as the percolator feature in elasticsearch. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: What is the difference between attorney:(Roger Miller) and attorney:Roger Miller
Also, attorney:(Roger Miller) is same as attorney:Roger Miller right? Or the term Roger Miller is run against attorney? Thanks, -Utkarsh On Tue, Nov 19, 2013 at 12:42 PM, Rafał Kuć r@solr.pl wrote: Hello! In the first one, the two terms 'Roger' and 'Miller' are run against the attorney field. In the second the 'Roger' term is run against the attorney field and the 'Miller' term is run against the default search field. -- Regards, Rafał Kuć Performance Monitoring * Log Analytics * Search Analytics Solr Elasticsearch Support * http://sematext.com/ We got different results for these two queries. The first one returned 115 records and the second returns 179 records. Thanks, Fudong -- Thanks, -Utkarsh
Re: High disk IO during UpdateCSV
Bumping this one again, any suggestions? On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I load data from csv to solr via UpdateCSV. There are about 50M documents with 10 columns in each document. The index size is about 15GB and I am using a 3 node distributed solr cluster. While loading the data the disk IO goes to 100%. if the load balancer in front of solr hits the machine which is doing the processing then the request times out. But in general, requests to all the machines become slow. I have attached a screenshot of the diskI/O and CPU usage. Is there a fix in solr which can possibly throttle the load or maybe its due to MergePolicy? How can I debug solr to get the exact cause? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: High disk IO during UpdateCSV
Hi Michael, I am using solr cloud 4.5. And update csv loads data to one of these nodes. Attachment: http://i.imgur.com/1xmoNtt.png Thanks, -Utkarsh On Wed, Nov 13, 2013 at 8:33 AM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Utkarsh, Your screenshot didn't come through. I don't think this list allows attachments. Maybe put it up on imgur or something? I'm a little unclear on whether you're using Solr in Cloud mode, or with a single master. Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Nov 13, 2013 at 11:22 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Bumping this one again, any suggestions? On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I load data from csv to solr via UpdateCSV. There are about 50M documents with 10 columns in each document. The index size is about 15GB and I am using a 3 node distributed solr cluster. While loading the data the disk IO goes to 100%. if the load balancer in front of solr hits the machine which is doing the processing then the request times out. But in general, requests to all the machines become slow. I have attached a screenshot of the diskI/O and CPU usage. Is there a fix in solr which can possibly throttle the load or maybe its due to MergePolicy? How can I debug solr to get the exact cause? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: High disk IO during UpdateCSV
Thanks guys! I will start splitting the file in chunks of 5M (10 chunks) to start with reduce the size if needed. Thanks, -Utkarsh On Wed, Nov 13, 2013 at 9:08 AM, Walter Underwood wun...@wunderwood.orgwrote: Don't load 50M documents in one shot. Break it up into reasonable chunks (100K?) with commits at each point. You will have a bottleneck somewhere, usually disk or CPU. Yours appears to be disk. If you get faster disks, it might become the CPU. wunder On Nov 13, 2013, at 8:22 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Bumping this one again, any suggestions? On Tue, Nov 12, 2013 at 3:58 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I load data from csv to solr via UpdateCSV. There are about 50M documents with 10 columns in each document. The index size is about 15GB and I am using a 3 node distributed solr cluster. While loading the data the disk IO goes to 100%. if the load balancer in front of solr hits the machine which is doing the processing then the request times out. But in general, requests to all the machines become slow. I have attached a screenshot of the diskI/O and CPU usage. Is there a fix in solr which can possibly throttle the load or maybe its due to MergePolicy? How can I debug solr to get the exact cause? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Walter Underwood wun...@wunderwood.org -- Thanks, -Utkarsh
High disk IO during UpdateCSV
Hello, I load data from csv to solr via UpdateCSV. There are about 50M documents with 10 columns in each document. The index size is about 15GB and I am using a 3 node distributed solr cluster. While loading the data the disk IO goes to 100%. if the load balancer in front of solr hits the machine which is doing the processing then the request times out. But in general, requests to all the machines become slow. I have attached a screenshot of the diskI/O and CPU usage. Is there a fix in solr which can possibly throttle the load or maybe its due to MergePolicy? How can I debug solr to get the exact cause? -- Thanks, -Utkarsh
Re: Stop/Restart Solr
We use this to start/stop solr: Start: java -Dsolr.clustering.enabled=true -Dsolr.solr.home=multicore -Djetty.class.path=lib/ext/* -Dbootstrap_conf=true -DnumShards=3 -DSTOP.PORT=8079 -DSTOP.KEY=some_value -jar start.jar Stop: java -Dsolr.solr.home=multicore -Dbootstrap_conf=true -DnumShards=3 -DSTOP.PORT=8079 -DSTOP.KEY=some_value -jar start.jar --stop Thanks, -Utkarsh On Tue, Oct 22, 2013 at 10:09 AM, Raheel Hasan raheelhasan@gmail.comwrote: ok fantastic... thanks a lot guyz On Tue, Oct 22, 2013 at 10:00 PM, François Schiettecatte fschietteca...@gmail.com wrote: Yago has the right command to search for the process, that will get you the process ID specifically the first number on the output line, then do 'kill ###', if that fails 'kill -9 ###'. François On Oct 22, 2013, at 12:56 PM, Raheel Hasan raheelhasan@gmail.com wrote: its CentOS... and using jetty with solr here.. On Tue, Oct 22, 2013 at 9:54 PM, François Schiettecatte fschietteca...@gmail.com wrote: A few more specifics about the environment would help, Windows/Linux/...? Jetty/Tomcat/...? François On Oct 22, 2013, at 12:50 PM, Yago Riveiro yago.rive...@gmail.com wrote: If you are asking about if solr has a way to restart himself, I think that the answer is no. If you lost control of the remote machine someone will need to go and restart the machine ... You can try use a kvm or other remote control system -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte wrote: If you are on linux/unix, use the kill command. François On Oct 22, 2013, at 12:42 PM, Raheel Hasan raheelhasan@gmail.com (mailto: raheelhasan@gmail.com) wrote: Hi, is there a way to stop/restart java? I lost control over it via SSH and connection was closed. But the Solr (start.jar) is still running. thanks. -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Regards, Raheel Hasan -- Thanks, -Utkarsh
Re: Check if dynamic columns exists and query else ignore
Bumping this one, any suggestions? Looks like if() and exists() are meant to solve this problem, but I am using it in a wrong way. -Utkarsh On Thu, Oct 17, 2013 at 1:16 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I trying to do this: if (US_offers_i exists): fq=US_offers_i:[1 TO *] else: fq=offers_count:[1 TO *] Where: US_offers_i is a dynamic field containing an int offers_count is a status field containing an int. I have tried this so far but it doesn't work: http://solr_server/solr/col1/select? q=iphone+5s fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *]) Also, there is a heavy performance penalty for this condition? I am planning to use this for all my queries. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Check if dynamic columns exists and query else ignore
Thanks Chris! That worked! I overengineered my query! Thanks, -Utkarsh On Fri, Oct 18, 2013 at 12:02 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I trying to do this: : : if (US_offers_i exists): :fq=US_offers_i:[1 TO *] : else: :fq=offers_count:[1 TO *] if() and exist() are functions, so you would have to explicitly use them in a function context (ie: {!func} parser, or {!frange} parser) and to use those nested queries inside of functions you'd need to use the query() function. but nothing about your problem description suggests that you really need to worry about this. If a document doesn't contain the US_offers_i then US_offers_i:[1 TO *] won't match that document, and neither will US_offers_i:[* TO *] -- so you can implement the logic you describe with a simple query... fq=(US_offers_i:[1 TO *] (offers_count:[1 TO *] -US_offers_i:[* TO *])) Which you can read as Match does with 1 or more US offers, or: docs that have 1 or more offers but no US offer field at all : Also, there is a heavy performance penalty for this condition? I am : planning to use this for all my queries. Any logic that you do at query time, which can be precomputed into a specific field in your index will *always* make the queries faster (at the expense of a little more time spent indexing and a little more disk used). If you know in advance that you are frequently going to want to ristrict on this type of logic, then unless you index docs more offten then you search docs, you should almost certainly index as has_offers boolean field that captures this logic. -Hoss -- Thanks, -Utkarsh
Check if dynamic columns exists and query else ignore
I trying to do this: if (US_offers_i exists): fq=US_offers_i:[1 TO *] else: fq=offers_count:[1 TO *] Where: US_offers_i is a dynamic field containing an int offers_count is a status field containing an int. I have tried this so far but it doesn't work: http://solr_server/solr/col1/select? q=iphone+5s fq=if(exist(US_offers_i),US_offers_i:[1 TO *], offers_count:[1 TO *]) Also, there is a heavy performance penalty for this condition? I am planning to use this for all my queries. -- Thanks, -Utkarsh
Re: Using split in updateCSV for SolrCloud 4.4
Interestingly this URL by Jack works: 1. curl ' http://localhost/solr/prodinfo/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22stream.contentType=text/csvstream.file=/tmp/test.csv ' But this doesn't (i.e. it doesn't split the column): 2. curl ' http://localhost/solr/prodinfo/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/catalog.txt ' The only difference was escape=\, I added that in Jack's example and it didn't work either. So the culprit was escape=\, not sure why. Thanks, -Utkarsh On Thu, Oct 10, 2013 at 6:11 PM, Yonik Seeley ysee...@gmail.com wrote: Perhaps try adding echoParams=all to check that all of the input params are being parsed as expected. -Yonik On Thu, Oct 10, 2013 at 8:10 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Didn't help. This is the complete data: https://gist.github.com/utkarsh2012/6927649(see merchantList column). I tried this URL: curl ' http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101 ' Can this be a bug in the UpdateCSV split function? Thanks, -Utkarsh On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky j...@basetechnology.com wrote: Using the standard Solr example for Solr 4.5, the following works, splitting the features CSV field into multiple values: curl http://localhost:8983/solr/**update/csv?commit=truef.** features.split=truef.**features.separator=%3Af.** features.encapsulator=%22 http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22 -H Content-Type: text/csv -d ' id,name,features doc-1,doc1,feat1:feat2' You may need to add stream.contentType=text/csv to you command. -- Jack Krupansky -Original Message- From: Utkarsh Sengar Sent: Thursday, October 10, 2013 4:51 PM To: solr-user@lucene.apache.org Subject: Using split in updateCSV for SolrCloud 4.4 Hello, I am trying to use split: http://wiki.apache.org/solr/**UpdateCSV#split http://wiki.apache.org/solr/UpdateCSV#splitwhile loading some csv data via updateCSV. This is the field: field name=merchantList type=string indexed=true stored=true multiValued=true omitNorms=true termVectors=false termPositions=false termOffsets=false/ This is the column in CSV (merchantList): values,16179:10950,.**values.. This is the URL I call: http://localhost/solr/coll1/**update/csv?commit=truef.** merchantList.split=truef.**merchantList.separator=%3Af.** merchantList.encapsulator= http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator= escape=\stream.file=/data/**dump/log_20130101' Currently when I load the data, I see this: merchantList: [16179:10950], But I want this: merchantList: [16179,10950], This example is int but I have intentionally kept it as a string since some values can also be a string. Any suggestions where I am going wrong? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Using split in updateCSV for SolrCloud 4.4
Hello, I am trying to use split: http://wiki.apache.org/solr/UpdateCSV#split while loading some csv data via updateCSV. This is the field: field name=merchantList type=string indexed=true stored=true multiValued=true omitNorms=true termVectors=false termPositions=false termOffsets=false/ This is the column in CSV (merchantList): values,16179:10950,.values.. This is the URL I call: http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator= escape=\stream.file=/data/dump/log_20130101' Currently when I load the data, I see this: merchantList: [16179:10950], But I want this: merchantList: [16179,10950], This example is int but I have intentionally kept it as a string since some values can also be a string. Any suggestions where I am going wrong? -- Thanks, -Utkarsh
Re: Using split in updateCSV for SolrCloud 4.4
Didn't help. This is the complete data: https://gist.github.com/utkarsh2012/6927649 (see merchantList column). I tried this URL: curl ' http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101 ' Can this be a bug in the UpdateCSV split function? Thanks, -Utkarsh On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky j...@basetechnology.comwrote: Using the standard Solr example for Solr 4.5, the following works, splitting the features CSV field into multiple values: curl http://localhost:8983/solr/**update/csv?commit=truef.** features.split=truef.**features.separator=%3Af.** features.encapsulator=%22http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22 -H Content-Type: text/csv -d ' id,name,features doc-1,doc1,feat1:feat2' You may need to add stream.contentType=text/csv to you command. -- Jack Krupansky -Original Message- From: Utkarsh Sengar Sent: Thursday, October 10, 2013 4:51 PM To: solr-user@lucene.apache.org Subject: Using split in updateCSV for SolrCloud 4.4 Hello, I am trying to use split: http://wiki.apache.org/solr/**UpdateCSV#splithttp://wiki.apache.org/solr/UpdateCSV#splitwhile loading some csv data via updateCSV. This is the field: field name=merchantList type=string indexed=true stored=true multiValued=true omitNorms=true termVectors=false termPositions=false termOffsets=false/ This is the column in CSV (merchantList): values,16179:10950,.**values.. This is the URL I call: http://localhost/solr/coll1/**update/csv?commit=truef.** merchantList.split=truef.**merchantList.separator=%3Af.** merchantList.encapsulator=http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator= escape=\stream.file=/data/**dump/log_20130101' Currently when I load the data, I see this: merchantList: [16179:10950], But I want this: merchantList: [16179,10950], This example is int but I have intentionally kept it as a string since some values can also be a string. Any suggestions where I am going wrong? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Using split in updateCSV for SolrCloud 4.4
@Jack I just noticed in your example that: feat1:feat2 is not in an encapsulator . Was that a typo or intentional? You are passing f.features.encapsulator=%22 but don't have around feat1:feat2. The request should look: curl http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22; -H Content-Type: text/csv -d ' id,name,features doc-1,doc1,feat1:feat2' Thanks, -Utkarsh On Thu, Oct 10, 2013 at 5:10 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Didn't help. This is the complete data: https://gist.github.com/utkarsh2012/6927649(see merchantList column). I tried this URL: curl ' http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=\stream.contentType=text/csvstream.file=/data/dump/log_20130101http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator=%22escape=%5Cstream.contentType=text/csvstream.file=/data/dump/log_20130101 ' Can this be a bug in the UpdateCSV split function? Thanks, -Utkarsh On Thu, Oct 10, 2013 at 3:11 PM, Jack Krupansky j...@basetechnology.comwrote: Using the standard Solr example for Solr 4.5, the following works, splitting the features CSV field into multiple values: curl http://localhost:8983/solr/**update/csv?commit=truef.** features.split=truef.**features.separator=%3Af.** features.encapsulator=%22http://localhost:8983/solr/update/csv?commit=truef.features.split=truef.features.separator=%3Af.features.encapsulator=%22 -H Content-Type: text/csv -d ' id,name,features doc-1,doc1,feat1:feat2' You may need to add stream.contentType=text/csv to you command. -- Jack Krupansky -Original Message- From: Utkarsh Sengar Sent: Thursday, October 10, 2013 4:51 PM To: solr-user@lucene.apache.org Subject: Using split in updateCSV for SolrCloud 4.4 Hello, I am trying to use split: http://wiki.apache.org/solr/**UpdateCSV#splithttp://wiki.apache.org/solr/UpdateCSV#splitwhile loading some csv data via updateCSV. This is the field: field name=merchantList type=string indexed=true stored=true multiValued=true omitNorms=true termVectors=false termPositions=false termOffsets=false/ This is the column in CSV (merchantList): values,16179:10950,.**values.. This is the URL I call: http://localhost/solr/coll1/**update/csv?commit=truef.** merchantList.split=truef.**merchantList.separator=%3Af.** merchantList.encapsulator=http://localhost/solr/coll1/update/csv?commit=truef.merchantList.split=truef.merchantList.separator=%3Af.merchantList.encapsulator= escape=\stream.file=/data/**dump/log_20130101' Currently when I load the data, I see this: merchantList: [16179:10950], But I want this: merchantList: [16179,10950], This example is int but I have intentionally kept it as a string since some values can also be a string. Any suggestions where I am going wrong? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Some text not indexed in solr4.4
@Furkan Yes, I have run a commit, other text is searchable. Not sure what you mean there for MultiPhraseQuery. It is mentioned in context to SynonymFilterFactory, RemoveDuplicatesTokenFilterFactory and PositionFilterFactory. Which part are you referring to? @Jason I get this response (I have multi-core setup) by hitting this URL: http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=textterms.prefix=dc responselst name=responseHeaderint name=status0/intint name=QTime0/int/lstlst name=termslst name=text//lst/response Not sure how can I infer this response. I get the same response for any prefix like: a, b, iph etc. My guess is this is happening due to WordDelimiterFilterFactory here: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L16, what do you think? dc44 is somehow delimited during the query time? Example here says: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory - Split on letter-number transitions (can be turned off - see splitOnNumerics parameter) SD500 - SD, 500 I will test it out and update this thread with my findings. Thanks, -Utkarsh On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Utkarsh, Check to see if the value is actually indexed into the field by using the Terms request handler: http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d (adjust the prefix to whatever you're looking for) This should get you going in the right direction. Jason On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I have a copyField called allText with type text_general: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68 I have ~100 documents which have the text: dyson and dc44 or dc41 etc. For example: title: Dyson DC44 Animal Digital Slim Cordless Vacuum description: The DC44 Animal is the new Dyson Digital Slim vacuum cleaner the cordless machine that doesn’t lose suction. It has been engineered for floor to ceiling cleaning. DC44 Animal has a detachable long-reach wand which is balanced for floor to ceiling cleaning. The motorized floor tool has twice the power of the DC35 floor tool to drive the bristles deeper into the carpet pile with more force. It attaches to the wand or directly to the machine for cleaning awkward spaces. The brush bar has carbon fiber filaments for removing fine dust from hard floors. DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode. Powered by the Dyson digital motor DC44 Animal has a fade-free nickel manganese cobalt battery and Root Cyclone technology for constant powerful suction., UPC: 0879957006362 The documents are indexed. Analysis says its indexeD: http://i.imgur.com/O52ino1.png But when I search for allText:dyson dc44 I get no results, response: http://pastie.org/8334220 Any suggestions about the problem? I am out of ideas about how to debug this. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Some text not indexed in solr4.4
WordDelimiterFilterFactory was the culprit. Removing that fixed the problem. Thanks, -Utkarsh On Tue, Sep 24, 2013 at 12:17 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: @Furkan Yes, I have run a commit, other text is searchable. Not sure what you mean there for MultiPhraseQuery. It is mentioned in context to SynonymFilterFactory, RemoveDuplicatesTokenFilterFactory and PositionFilterFactory. Which part are you referring to? @Jason I get this response (I have multi-core setup) by hitting this URL: http://SOLR_SERVER/solr/prodinfo/terms?terms.fl=textterms.prefix=dc responselst name=responseHeaderint name=status0/intint name=QTime0/int/lstlst name=termslst name=text//lst/response Not sure how can I infer this response. I get the same response for any prefix like: a, b, iph etc. My guess is this is happening due to WordDelimiterFilterFactory here: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L16, what do you think? dc44 is somehow delimited during the query time? Example here says: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory - Split on letter-number transitions (can be turned off - see splitOnNumerics parameter) SD500 - SD, 500 I will test it out and update this thread with my findings. Thanks, -Utkarsh On Tue, Sep 17, 2013 at 5:10 PM, Jason Hellman jhell...@innoventsolutions.com wrote: Utkarsh, Check to see if the value is actually indexed into the field by using the Terms request handler: http://localhost:8983/solr/terms?terms.fl=textterms.prefix=d (adjust the prefix to whatever you're looking for) This should get you going in the right direction. Jason On Sep 17, 2013, at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I have a copyField called allText with type text_general: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68 I have ~100 documents which have the text: dyson and dc44 or dc41 etc. For example: title: Dyson DC44 Animal Digital Slim Cordless Vacuum description: The DC44 Animal is the new Dyson Digital Slim vacuum cleaner the cordless machine that doesn’t lose suction. It has been engineered for floor to ceiling cleaning. DC44 Animal has a detachable long-reach wand which is balanced for floor to ceiling cleaning. The motorized floor tool has twice the power of the DC35 floor tool to drive the bristles deeper into the carpet pile with more force. It attaches to the wand or directly to the machine for cleaning awkward spaces. The brush bar has carbon fiber filaments for removing fine dust from hard floors. DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode. Powered by the Dyson digital motor DC44 Animal has a fade-free nickel manganese cobalt battery and Root Cyclone technology for constant powerful suction., UPC: 0879957006362 The documents are indexed. Analysis says its indexeD: http://i.imgur.com/O52ino1.png But when I search for allText:dyson dc44 I get no results, response: http://pastie.org/8334220 Any suggestions about the problem? I am out of ideas about how to debug this. -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Dynamic row sizing for documents via UpdateCSV
Yeah I think the only way to go about it is via SolrJ. The csv file is generated by a pig job which computes the data to be loaded in solr. I think this is what I will endup doing: Load all the possible columns in the csv with a value of 0 if the value doesn't exist for a specific record. I was just trying to avoid it and find an optimal solution with UpdateCSV. Thanks, -Utkarsh On Tue, Sep 17, 2013 at 5:43 AM, Erick Erickson erickerick...@gmail.comwrote: Well, it's reasonably easy if you have empty columns, in the same order, for _all_ of the possible dynamic fields, but I really doubt you are that fortunate... It's especially ugly in that you have the different dynamic fields scattered around. How is the csv file generated? Could you force every row to have _all_ the possible columns in the same order with spaces or something in the columns that are empty? Otherwise I'd think about parsing them externally and using, say, SolrJ to transmit the individual records to Solr. Best, Erick On Mon, Sep 16, 2013 at 2:47 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I am using UpdateCSV to load data in solr. Currently I load this schema with a static set of values: userid,name,age,location john8322,John,32,CA tom22,Tom,30,NY But now I have this usecase where john8322 might have a state specific dynamic field for example: userid,name,age,location, ca_count_i john8322,John,32,CA, 7 And tom22 might have different dynamic fields: userid,name,age,location, ny_count_i,oh_count_i tom22,Tom,30,NY, 981,11 So is it possible to pass different columns sizes for each row, something like this: john8322,John,32,CA,ca_count_i:7 tom22,Tom,30,NY, ny_count_i:981,oh_count_i:11 I understand that the above syntax is not possible, but is there any other way of solving this problem? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Some text not indexed in solr4.4
I have a copyField called allText with type text_general: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68 I have ~100 documents which have the text: dyson and dc44 or dc41 etc. For example: title: Dyson DC44 Animal Digital Slim Cordless Vacuum description: The DC44 Animal is the new Dyson Digital Slim vacuum cleaner the cordless machine that doesn’t lose suction. It has been engineered for floor to ceiling cleaning. DC44 Animal has a detachable long-reach wand which is balanced for floor to ceiling cleaning. The motorized floor tool has twice the power of the DC35 floor tool to drive the bristles deeper into the carpet pile with more force. It attaches to the wand or directly to the machine for cleaning awkward spaces. The brush bar has carbon fiber filaments for removing fine dust from hard floors. DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode. Powered by the Dyson digital motor DC44 Animal has a fade-free nickel manganese cobalt battery and Root Cyclone technology for constant powerful suction., UPC: 0879957006362 The documents are indexed. Analysis says its indexeD: http://i.imgur.com/O52ino1.png But when I search for allText:dyson dc44 I get no results, response: http://pastie.org/8334220 Any suggestions about the problem? I am out of ideas about how to debug this. -- Thanks, -Utkarsh
Re: Some text not indexed in solr4.4
To add to it, I see the exact problem with the queries: nikon d7100, nikon d5100, samsung ps-we450 etc. Thanks, -Utkarsh On Tue, Sep 17, 2013 at 2:20 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I have a copyField called allText with type text_general: https://gist.github.com/utkarsh2012/6167128#file-schema-xml-L68 I have ~100 documents which have the text: dyson and dc44 or dc41 etc. For example: title: Dyson DC44 Animal Digital Slim Cordless Vacuum description: The DC44 Animal is the new Dyson Digital Slim vacuum cleaner the cordless machine that doesn’t lose suction. It has been engineered for floor to ceiling cleaning. DC44 Animal has a detachable long-reach wand which is balanced for floor to ceiling cleaning. The motorized floor tool has twice the power of the DC35 floor tool to drive the bristles deeper into the carpet pile with more force. It attaches to the wand or directly to the machine for cleaning awkward spaces. The brush bar has carbon fiber filaments for removing fine dust from hard floors. DC44 Animal has a run time of 20 minutes or 8 minutes on Boost mode. Powered by the Dyson digital motor DC44 Animal has a fade-free nickel manganese cobalt battery and Root Cyclone technology for constant powerful suction., UPC: 0879957006362 The documents are indexed. Analysis says its indexeD: http://i.imgur.com/O52ino1.png But when I search for allText:dyson dc44 I get no results, response: http://pastie.org/8334220 Any suggestions about the problem? I am out of ideas about how to debug this. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Dynamic row sizing for documents via UpdateCSV
Hello, I am using UpdateCSV to load data in solr. Currently I load this schema with a static set of values: userid,name,age,location john8322,John,32,CA tom22,Tom,30,NY But now I have this usecase where john8322 might have a state specific dynamic field for example: userid,name,age,location, ca_count_i john8322,John,32,CA, 7 And tom22 might have different dynamic fields: userid,name,age,location, ny_count_i,oh_count_i tom22,Tom,30,NY, 981,11 So is it possible to pass different columns sizes for each row, something like this: john8322,John,32,CA,ca_count_i:7 tom22,Tom,30,NY, ny_count_i:981,oh_count_i:11 I understand that the above syntax is not possible, but is there any other way of solving this problem? -- Thanks, -Utkarsh
Re: What does it mean when a shard is down in solr4.4?
bumping this one, any suggestions? I am sure this is solrcloud 101 but I couldn't find documentation anywhere. Thanks, -Utkarsh On Wed, Aug 28, 2013 at 2:37 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I have a 3 node solrcloud cluster with 3 shards for each collection/core. At times when I rebuild the index say on collectionA on nodeA (shard1) via UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1) is down. Observations: 1. Other collections on nodeA work. 2. collectionA on nodeB and nodeC works. 3. nodeA's solr admin is accessible too. So my questions are: 1. What does it really mean when a shard goes down? 2. How can I recover from that state? Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png -- Thanks, -Utkarsh -- Thanks, -Utkarsh
What does it mean when a shard is down in solr4.4?
I have a 3 node solrcloud cluster with 3 shards for each collection/core. At times when I rebuild the index say on collectionA on nodeA (shard1) via UpdateCSV, the Cloud status page says that collectionA on nodeA (shard1) is down. Observations: 1. Other collections on nodeA work. 2. collectionA on nodeB and nodeC works. 3. nodeA's solr admin is accessible too. So my questions are: 1. What does it really mean when a shard goes down? 2. How can I recover from that state? Solr cloud screenshot: http://i.imgur.com/2TgKXiC.png -- Thanks, -Utkarsh
Re: No documents found for some queries with special chars like mm
Thanks for the info. 1. http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn: { responseHeader:{ status:0, QTime:16, params:{ debugQuery:true, indent:true, q:o'reilly, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, debug:{ rawquerystring:o'reilly, querystring:o'reilly, parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly oreilly)\), parsedquery_toString:allText:\o'reilly (reilly oreilly)\, QParser:LuceneQParser, explain:{} } } 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this means tokens are same for o'reilly 3. I tried escaping ', it doesn’t help: http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=true I will add WordDelimiterFilterFactory for index and see if it fixes the problem. Thanks, -Utkarsh On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson erickerick...@gmail.comwrote: First thing to do is attach query=debug to your queries and look at the parsed output. Second thing to do is look at the admin/analysis page and see what happens at index and query time to things like o'reilly. You have WordDelimiterFilterFactory configured in your query but not index analysis chain. My bet on that is that you're getting different tokens at query and index time... Third thing is that you need to escape the character. It's probably being interpreted as a delimiter on the URL and Solr ignores params it doesn't understand. Best Erick On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Some of the queries (not all) with special chars return no documents. Example: queries returning no documents q=mm (this can be explained, when I search for m m, no documents are returned) q=o'reilly (when I search for o reilly, I get documents back) Queries returning documents: q=helloworld (document matched is Hello World: A Life in Ham Radio) My questions are: 1. What's wrong with o'reilly? What changes do I need in my field type? 2. How can I make the query mm work? My indexe has a bunch of MM's docs like: M M's Milk Chocolate Candy Coated Peanuts 19.2 oz and M and Ms Chocolate Candies - Peanut - 1 Bag (42 oz) FIeld type: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: No documents found for some queries with special chars like mm
Yup, the query o'reilly worked after adding WDF to the index analyser. Although mm or m\m doesn't work. Field analysis for mm says: ST m, m WDF m, m ST m, m WDF m, m So essentially is ignored during the index or the query. My guess is, the standard tokenize is the problem. As the documentation says: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory Example: I.B.M. 8.5 can't!!! == ALPHANUM: I.B.M., NUM:8.5, ALPHANUM:can't The char will be ignored I guess. *So, my question is:* Is there a way I can make mm index as one string AND also keep StandardTokenizerFactory since I need it for other searches. Thanks, -Utkarsh On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Thanks for the info. 1. http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn: { responseHeader:{ status:0, QTime:16, params:{ debugQuery:true, indent:true, q:o'reilly, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, debug:{ rawquerystring:o'reilly, querystring:o'reilly, parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly oreilly)\), parsedquery_toString:allText:\o'reilly (reilly oreilly)\, QParser:LuceneQParser, explain:{} } } 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this means tokens are same for o'reilly 3. I tried escaping ', it doesn’t help: http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=truehttp://SERVER/solr/prodinfo/select?q=o%5C%27reillywt=jsonindent=true I will add WordDelimiterFilterFactory for index and see if it fixes the problem. Thanks, -Utkarsh On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson erickerick...@gmail.comwrote: First thing to do is attach query=debug to your queries and look at the parsed output. Second thing to do is look at the admin/analysis page and see what happens at index and query time to things like o'reilly. You have WordDelimiterFilterFactory configured in your query but not index analysis chain. My bet on that is that you're getting different tokens at query and index time... Third thing is that you need to escape the character. It's probably being interpreted as a delimiter on the URL and Solr ignores params it doesn't understand. Best Erick On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Some of the queries (not all) with special chars return no documents. Example: queries returning no documents q=mm (this can be explained, when I search for m m, no documents are returned) q=o'reilly (when I search for o reilly, I get documents back) Queries returning documents: q=helloworld (document matched is Hello World: A Life in Ham Radio) My questions are: 1. What's wrong with o'reilly? What changes do I need in my field type? 2. How can I make the query mm work? My indexe has a bunch of MM's docs like: M M's Milk Chocolate Candy Coated Peanuts 19.2 oz and M and Ms Chocolate Candies - Peanut - 1 Bag (42 oz) FIeld type: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: No documents found for some queries with special chars like mm
Use a different tokenizer, possibly one of the regex ones. fake it with phrase queries. Take a really good look at the various filter combinations. It's possible that WhitespaceTokenizer and WordDelimiterFilterFactory might be able to do good things. Will try to play with these two options. Clearly define whether this is capability that you really need. Yes, this is a needed feature. Some of our queries are att, hm, mm. Returning an empty response is not one of the best experience. I also tried: filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1 types=wdfftypes.txt/ With: wdfftypes.txt: = ALPHA \u0026 = ALPHA $ = DIGIT % = DIGIT . = DIGIT \u002C = DIGIT But it didn't work. Thanks, -Utkarsh On Tue, Aug 27, 2013 at 3:07 PM, Erick Erickson erickerick...@gmail.comwrote: bq: Is there a way I can make mm index as one string AND also keep StandardTokenizerFactory since I need it for other searches. In a word, no. You get one and only one tokenizer per field. But there are lots of options: Use a different tokenizer, possibly one of the regex ones. fake it with phrase queries. Take a really good look at the various filter combinations. It's possible that WhitespaceTokenizer and WordDelimiterFilterFactory might be able to do good things. Clearly define whether this is capability that you really need. This last is my recurring plea to insure that the effort is of real benefit to the user and not just something someone noticed that's actually only useful 0.001% of the time. Best Erick On Tue, Aug 27, 2013 at 5:00 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Yup, the query o'reilly worked after adding WDF to the index analyser. Although mm or m\m doesn't work. Field analysis for mm says: ST m, m WDF m, m ST m, m WDF m, m So essentially is ignored during the index or the query. My guess is, the standard tokenize is the problem. As the documentation says: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory Example: I.B.M. 8.5 can't!!! == ALPHANUM: I.B.M., NUM:8.5, ALPHANUM:can't The char will be ignored I guess. *So, my question is:* Is there a way I can make mm index as one string AND also keep StandardTokenizerFactory since I need it for other searches. Thanks, -Utkarsh On Tue, Aug 27, 2013 at 11:44 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks for the info. 1. http://SERVER/solr/prodinfo/select?q=o%27reillywt=jsonindent=truedebugQuery=truereturn : { responseHeader:{ status:0, QTime:16, params:{ debugQuery:true, indent:true, q:o'reilly, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, debug:{ rawquerystring:o'reilly, querystring:o'reilly, parsedquery:MultiPhraseQuery(allText:\o'reilly (reilly oreilly)\), parsedquery_toString:allText:\o'reilly (reilly oreilly)\, QParser:LuceneQParser, explain:{} } } 2. Analysis gives this: http://i.imgur.com/IPEiiEQ.png I assume this means tokens are same for o'reilly 3. I tried escaping ', it doesn’t help: http://SERVER/solr/prodinfo/select?q=o\%27reillywt=jsonindent=true http://SERVER/solr/prodinfo/select?q=o%5C%27reillywt=jsonindent=true I will add WordDelimiterFilterFactory for index and see if it fixes the problem. Thanks, -Utkarsh On Mon, Aug 26, 2013 at 3:15 PM, Erick Erickson erickerick...@gmail.com wrote: First thing to do is attach query=debug to your queries and look at the parsed output. Second thing to do is look at the admin/analysis page and see what happens at index and query time to things like o'reilly. You have WordDelimiterFilterFactory configured in your query but not index analysis chain. My bet on that is that you're getting different tokens at query and index time... Third thing is that you need to escape the character. It's probably being interpreted as a delimiter on the URL and Solr ignores params it doesn't understand. Best Erick On Mon, Aug 26, 2013 at 5:08 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Some of the queries (not all) with special chars return no documents. Example: queries returning no documents q=mm (this can be explained, when I search for m m, no documents are returned) q=o'reilly (when I search for o reilly, I get documents back) Queries returning documents: q=helloworld (document matched is Hello World: A Life in Ham Radio) My questions are: 1. What's wrong with o'reilly? What changes do I need in my field type? 2. How can I make the query mm work? My indexe has a bunch
No documents found for some queries with special chars like mm
Some of the queries (not all) with special chars return no documents. Example: queries returning no documents q=mm (this can be explained, when I search for m m, no documents are returned) q=o'reilly (when I search for o reilly, I get documents back) Queries returning documents: q=helloworld (document matched is Hello World: A Life in Ham Radio) My questions are: 1. What's wrong with o'reilly? What changes do I need in my field type? 2. How can I make the query mm work? My indexe has a bunch of MM's docs like: M M's Milk Chocolate Candy Coated Peanuts 19.2 oz and M and Ms Chocolate Candies - Peanut - 1 Bag (42 oz) FIeld type: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- Thanks, -Utkarsh
Re: loading solr from Pig?
That's a good point, we load data from pig to solr everyday. 1. What we do: Pig jobs creates a csv dump, scp it over to a solr node and UpdateCSV request handler loads the data in solr. A complete rebuild of index for about 50M documents (20GB) takes 20mins (pig job which pulls and processes data in cassandra and UpdateCSV loads). 2. Alternate way: Another way I explored was writing a PIG UDF which POSTS to solr. But batch http posts were slower than a CSV load for a full index rebuild (and that was an important usecase for us). These might not be the best practices, would like to know how others handling this problem. Thanks, -Utkarsh On Wed, Aug 21, 2013 at 11:29 AM, geeky2 gee...@hotmail.com wrote: Hello All, Is anyone loading Solr from a Pig script / process? I was talking to another group in our company and they have standardized on MongoDB instead of Solr - apparently there is very good support between MongoDB and Pig - allowing users to stream data directly from a Pig process in to MongoDB. Does solr have anything like this as well? thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/loading-solr-from-Pig-tp4085933.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks, -Utkarsh
Re: What filter to use to search with spaces omitted/included between words?
Thanks Tamanjit and Erick. I tried out the filters, most of the usecases work except q=bestbuy. As mentioned by Erick, that is a hard one to crack. I am looking into DictionaryCompoundWordTokenFilterFactory but compound words like these: http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_words and generic english words, it won't cover my need of custom compound words of store names like BestBuy, WalMart or CirtuitCity. Thanks, -Utkarsh On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky j...@basetechnology.comwrote: You could either have a synonym filter to replace bestbuy with best buy or use DictionaryCompoundWordTokenFil**terFactory to do the same. See: http://lucene.apache.org/core/**4_4_0/analyzers-common/org/** apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil** terFactory.htmlhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html There are some examples in my book, but they are for German compound words since that was the original primary intent for this filter. But it should work for any words since it is a simple dictionary. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Tuesday, August 20, 2013 7:21 AM To: solr-user@lucene.apache.org Subject: Re: What filter to use to search with spaces omitted/included between words? Also consider WordDelimterFilterFactory, which will break up the tokens on upper/lower case transitions. to get relevance, consider edismax-style query parsers and adding automatic phrase generation (with boosts usually). This one will be a problem: q=bestbuy There's no good generic way to get this to split up. One possibility is to use synonyms if the list is known, but otherwise there's no information here to distinguish it from legitimate words. edgeNgrams work on _tokens_, not words so I doubt they would help in this case either since there is only one token. Best Erick On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in wrote: Additionally, if you dont want results like q=best and result=bestbuy; you can use charFilter class=solr.**PatternReplaceCharFilterFactor**y pattern=\W+ replacement=/ to actually replace whitespaces with nothing. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories -- View this message in context: http://lucene.472066.n3.**nabble.com/What-filter-to-use-** to-search-with-spaces-omitted-**included-between-words-** tp4085576p4085601.htmlhttp://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks, -Utkarsh
Re: What filter to use to search with spaces omitted/included between words?
Let me take that back, this actually works. q=bestbuy matches Best Buy and documents are returned. fieldType name=rl_keywords class=solr.TextField positionIncrementGap=100 analyzer type=index filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ tokenizer class=solr.KeywordTokenizerFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ tokenizer class=solr.KeywordTokenizerFactory/ /analyzer /fieldType I was using tokenizer class=solr.StandardTokenizerFactory/, replacing it with tokenizer class=solr.KeywordTokenizerFactory/ did the trick. Not sure how it worked. The field value I am searching is Best Buy, but when I search for bestbuy, it returns a result. Thanks, -Utkarsh On Tue, Aug 20, 2013 at 4:48 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Thanks Tamanjit and Erick. I tried out the filters, most of the usecases work except q=bestbuy. As mentioned by Erick, that is a hard one to crack. I am looking into DictionaryCompoundWordTokenFilterFactory but compound words like these: http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_wordsand generic english words, it won't cover my need of custom compound words of store names like BestBuy, WalMart or CirtuitCity. Thanks, -Utkarsh On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky j...@basetechnology.comwrote: You could either have a synonym filter to replace bestbuy with best buy or use DictionaryCompoundWordTokenFil**terFactory to do the same. See: http://lucene.apache.org/core/**4_4_0/analyzers-common/org/** apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil** terFactory.htmlhttp://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html There are some examples in my book, but they are for German compound words since that was the original primary intent for this filter. But it should work for any words since it is a simple dictionary. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Tuesday, August 20, 2013 7:21 AM To: solr-user@lucene.apache.org Subject: Re: What filter to use to search with spaces omitted/included between words? Also consider WordDelimterFilterFactory, which will break up the tokens on upper/lower case transitions. to get relevance, consider edismax-style query parsers and adding automatic phrase generation (with boosts usually). This one will be a problem: q=bestbuy There's no good generic way to get this to split up. One possibility is to use synonyms if the list is known, but otherwise there's no information here to distinguish it from legitimate words. edgeNgrams work on _tokens_, not words so I doubt they would help in this case either since there is only one token. Best Erick On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bin...@yahoo.co.in tamanjit.bin...@yahoo.co.in wrote: Additionally, if you dont want results like q=best and result=bestbuy; you can use charFilter class=solr.**PatternReplaceCharFilterFactor**y pattern=\W+ replacement=/ to actually replace whitespaces with nothing. http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter** s#CharFilterFactorieshttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories -- View this message in context: http://lucene.472066.n3.**nabble.com/What-filter-to-use-** to-search-with-spaces-omitted-**included-between-words-** tp4085576p4085601.htmlhttp://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
What filter to use to search with spaces omitted/included between words?
I have a field which consists of a store name. How can I make sure that these queries return relevant results when searched against this column: *Example1: Best Buy* q=best (tokenizer filter makes this work) q=bestbuy q=buy (tokenizer filter makes this work) q=best buy (lower case filter makes this work) q=Best Buy (this should work) *Example2: CircuitCity* q=circuit (adding * will fix it, but if I append it to every query, it creates a lot of noise too) q=CircuitCity (this should work) q=city (adding * will fix it, but if I append it to every query, it creates a lot of noise too) q=circuit city q=Circuit City -- Thanks, -Utkarsh
Load a list of values in a solr field and query over its items
Hello, Is it possible to load a list in a solr filed and query for items in that list? example_core1: document1: FieldName=user_ids Value=8,6,1,9,3,5,7 FieldName=allText Value=text to be searched over with title and description document2: FieldName=user_ids Value=8738,624623,7272.82272,733 FieldName=allText Value=more text for document2 Query: allText:hello fq:user_ids:8,8738 Result: All documents who have the text hello in allText and userId=8 If this is not possible, what is a better way to solve this problem? -- Thanks, -Utkarsh
Re: Load a list of values in a solr field and query over its items
Never mind,got my answer here: http://stackoverflow.com/a/5800830/231917 field name=tagstag1/tags field name=tagstag2/tags ... field name=tagstagn/tags once you have all the values index you can search or filter results by any value, e,g. you can find all documents with tag1 using query like q=tags:tag1 or use the tags to filter out results like q=queryfq=tags:tag1 Thanks! -Utkarsh On Wed, Aug 14, 2013 at 11:57 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Thanks Aloke! So a multivalued field assumes: 1. if data is inserted in this form: 8738,624623,7272,82272,733, there are 5 unique values separated by a comma (or any other separator)? 2. And a filter query can be applied over it? Thanks, -Utkarsh On Wed, Aug 14, 2013 at 11:45 AM, Aloke Ghoshal alghos...@gmail.comwrote: Should work once you set up both fields as multiValued ( http://wiki.apache.org/solr/SchemaXml#Common_field_options). On Thu, Aug 15, 2013 at 12:07 AM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, Is it possible to load a list in a solr filed and query for items in that list? example_core1: document1: FieldName=user_ids Value=8,6,1,9,3,5,7 FieldName=allText Value=text to be searched over with title and description document2: FieldName=user_ids Value=8738,624623,7272.82272,733 FieldName=allText Value=more text for document2 Query: allText:hello fq:user_ids:8,8738 Result: All documents who have the text hello in allText and userId=8 If this is not possible, what is a better way to solve this problem? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Suggest aka autocomplete request handler with solr 4.4
HI Chris, You were right, appl was matched to application. So, I created a new type without the stemmer. New type: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 preserveOriginal=1/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer Which has a field: field name=spellText type=text_spell indexed=true stored=false multiValued=true omitNorms=true termVectors=false termPositions=false termOffsets=false/ Which is a copyField: copyField source=title dest=spellText/ copyField source=description dest=spellText/ copyField source=category dest=spellText/ copyField source=brand dest=spellText/ copyField source=subtitle dest=spellText/ Although this is my problem now: When I run this query: http://SOLR_SERVER/solr/prodinfo/spell?q=delllspellcheck=truespellcheck.collate=truespellcheck.build=true I get this response: response lst name=responseHeader int name=status0/int int name=QTime9/int /lst str name=commandbuild/str result name=response numFound=0 start=0 maxScore=0.0/ lst name=spellcheck lst name=suggestions bool name=correctlySpelledfalse/bool /lst /lst /response It knows the term is incorrect, but I don't get any suggestions back. What can be wrong here? Thanks, -Utkarsh On Thu, Aug 8, 2013 at 7:19 AM, Vinícius vinicius.remi...@gmail.com wrote: if correctSpelled is true, then appl was found in solr index. In this case, maybe the EnglishMinimalStemFilterFactory filter in text_general fieldType is messing your suggestion. On 6 August 2013 15:33, Utkarsh Sengar utkarsh2...@gmail.com wrote: Jack/Chris, 1. This is my complete schema.xml: https://gist.github.com/utkarsh2012/6167128/raw/1d5ac6520b666435cd040b5cc6dcb434cdfd7925/schema.xml More specifically, allText is of type: text_general which has a LowerCaseFatcory during index time. 2. allText has values: http://solr_server/solr/prodinfo/terms?terms.fl=allTextterms.limit=100indent=truereturns a lot of values. I have never used the /term request handler, but it very slow. 3. When I try this query: http://solr_server/solr/prodinfo/spell?q=applspellcheck=truespellcheck.collate=truespellcheck.build=true , I get documents back which match the query: appl. But my expectation is to get the spell corrected keywords back like apple AND the documents with the keyword apple. Response from the above query: result doc./doc doc./doc .. /result lst name=spellcheck lst name=suggestions bool name=correctlySpelledtrue/bool /lst /lst Thanks, -Utkarsh On Mon, Aug 5, 2013 at 4:56 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : Where allText is a copy field which indexes all the content I have in : document title, description etc. what does the field fieldType of allText look like? : I have reindexed my data after adding this config (i.e. loading the whole : dataset again via UpdateCSV), also tried to reload the core via http. did you note the comments on that page regarding spellcheck.build ? NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit. -Hoss -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Suggest aka autocomplete request handler with solr 4.4
Jack/Chris, 1. This is my complete schema.xml: https://gist.github.com/utkarsh2012/6167128/raw/1d5ac6520b666435cd040b5cc6dcb434cdfd7925/schema.xml More specifically, allText is of type: text_general which has a LowerCaseFatcory during index time. 2. allText has values: http://solr_server/solr/prodinfo/terms?terms.fl=allTextterms.limit=100indent=truereturns a lot of values. I have never used the /term request handler, but it very slow. 3. When I try this query: http://solr_server/solr/prodinfo/spell?q=applspellcheck=truespellcheck.collate=truespellcheck.build=true, I get documents back which match the query: appl. But my expectation is to get the spell corrected keywords back like apple AND the documents with the keyword apple. Response from the above query: result doc./doc doc./doc .. /result lst name=spellcheck lst name=suggestions bool name=correctlySpelledtrue/bool /lst /lst Thanks, -Utkarsh On Mon, Aug 5, 2013 at 4:56 PM, Chris Hostetter hossman_luc...@fucit.orgwrote: : Where allText is a copy field which indexes all the content I have in : document title, description etc. what does the field fieldType of allText look like? : I have reindexed my data after adding this config (i.e. loading the whole : dataset again via UpdateCSV), also tried to reload the core via http. did you note the comments on that page regarding spellcheck.build ? NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit. -Hoss -- Thanks, -Utkarsh
Re: Suggest aka autocomplete request handler with solr 4.4
Bumping this one, is this feature maintained anymore? Thanks, -Utkarsh On Fri, Aug 2, 2013 at 2:27 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I am trying to get autocorrect and suggest feature work on my solr 4.4 setup. As recommended here: http://wiki.apache.org/solr/Suggester, this is my solrconfig: http://apaste.info/eBPr Where allText is a copy field which indexes all the content I have in document title, description etc. I am trying to use it like this: http://solr_server/solr/core1/suggest?q=appl and I expect to see apple but I get this response: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst /response I have reindexed my data after adding this config (i.e. loading the whole dataset again via UpdateCSV), also tried to reload the core via http. So, I have 2 questions: 1. Is there a better way to reindex from the solr admin panel directly without actually going through the process of loading the data again? 2. Any suggestions on what am I missing with the suggest request handler? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Suggest aka autocomplete request handler with solr 4.4
I am trying to get autocorrect and suggest feature work on my solr 4.4 setup. As recommended here: http://wiki.apache.org/solr/Suggester, this is my solrconfig: http://apaste.info/eBPr Where allText is a copy field which indexes all the content I have in document title, description etc. I am trying to use it like this: http://solr_server/solr/core1/suggest?q=appl and I expect to see apple but I get this response: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst /response I have reindexed my data after adding this config (i.e. loading the whole dataset again via UpdateCSV), also tried to reload the core via http. So, I have 2 questions: 1. Is there a better way to reindex from the solr admin panel directly without actually going through the process of loading the data again? 2. Any suggestions on what am I missing with the suggest request handler? -- Thanks, -Utkarsh
Re: Sort top N results in solr after boosting
Thanks guys! Will play around with it function query. Thanks, -Utkarsh On Tue, Jul 30, 2013 at 10:50 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : bq: I am also trying to figure out if I can place : extra dimensions to the solr score which takes other attributes into : consideration To re-iterate erick's point, you should definitely look at using things like the {!boost} qparser combined with funciton queries that take into account pre-comuted numeric data bsaed on your domain knowledge to *augment* the scoring you get from text relevancy -- that is likeley to prove far superior to taking some arbitrary cut-off of the top N documents and then sorting based on your domain knowledge... https://people.apache.org/~hossman/ac2012eu/ https://www.youtube.com/watch?v=AosaVoBk8oklist=PLsj1Ri57ZE94lISrJuy7W8COc2RNFC1Flindex=2 -Hoss -- Thanks, -Utkarsh
Re: monitor jvm heap size for solrcloud
We have been using newrelic (they have a free plan too) and gives all needed info like: jvm heap usage in eden space, survivor space and old gen. Garbage collection info, detailed info about the solr requests and its response times, error rates etc. I highly recommend using newrelic to monitor your solr cluster: http://blog.newrelic.com/2010/05/11/got-apache-solr-search-server-use-rpm-to-monitor-troubleshoot-and-tune-solr-operations/ Thanks, -Utkarsh On Fri, Jul 26, 2013 at 2:38 PM, SolrLover bbar...@gmail.com wrote: I have used JMX with SOLR before.. http://docs.lucidworks.com/display/solr/Using+JMX+with+Solr -- View this message in context: http://lucene.472066.n3.nabble.com/monitor-jvm-heap-size-for-solrcloud-tp4080713p4080725.html Sent from the Solr - User mailing list archive at Nabble.com. -- Thanks, -Utkarsh
Re: Sort top N results in solr after boosting
I agree with your comment on separating noise with the actual relevant result. My approach to separate relevant result with noise is not algorithmic but an absolute measure, i.e. top 5 or top 10 results will always be relevant (at-least the probability is higher). But again, that kind of simple sort can be done by the client too. The current relevant results are purely based off PMIs which is calculated using the clickstream data. I am also trying to figure out if I can place extra dimensions to the solr score which takes other attributes into consideration. i.e. extending the way solr computes the score with attachment_count (more attachments, more important), confidence (stronger source has higher confidence) etc. Is there a way I can have my custom scoring function which extends (and not overwrites) solr's scores? Thanks, -Utkarsh On Wed, Jul 24, 2013 at 7:35 PM, Erick Erickson erickerick...@gmail.comwrote: You can certainly just include the attachment count in the response and have the app apply the secondary sort. But that doesn't separate the noise as you say. How would you identify noise? If you don't have an algorithmic way to do that, I don't know how you'd manage to separate the signal from the noise Best Erick On Wed, Jul 24, 2013 at 4:37 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I have a solr query which has a bunch of boost params for relevancy. This search works fine and returns the most relevant documents as per the user query. For example, if user searches for: iphone 5, keywords like apple, wifi etc are boosted. I get these keywords from external training. The top 10-20 results are iphone 5 phones and then it follows iphone cases and other noise. But I also have a field in the schema called: attachment_count. I need to sort the top N result I get after boost based on this field. Example: I want to sort the top 5 documents based on attachment_count on the boosted result (which are relevant for the user). 1. iphone 5 32gb, attachment_count=0 2. iphone 5 16gb, attachment_count=5 3. iphone 5 32gb, attachment_count=10 4. iphone 4gs, attachment_count=3 5. iphone 4, attachment_count=1 ... 11. iphone 5 case, attachment_count=100 Expected result: 1. iphone 5 32gb, attachment_count=10 2. iphone 5 16gb, attachment_count=5 3. iphone 4gs, attachment_count=3 4. iphone 4, attachment_count=1 5. iphone 5 32gb, attachment_count=0 ... 11. iphone 5 case, attachment_count=100 Is this possible using a function query? I am not sure how the results will look like but I want to try it out. -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Sort top N results in solr after boosting
I have a solr query which has a bunch of boost params for relevancy. This search works fine and returns the most relevant documents as per the user query. For example, if user searches for: iphone 5, keywords like apple, wifi etc are boosted. I get these keywords from external training. The top 10-20 results are iphone 5 phones and then it follows iphone cases and other noise. But I also have a field in the schema called: attachment_count. I need to sort the top N result I get after boost based on this field. Example: I want to sort the top 5 documents based on attachment_count on the boosted result (which are relevant for the user). 1. iphone 5 32gb, attachment_count=0 2. iphone 5 16gb, attachment_count=5 3. iphone 5 32gb, attachment_count=10 4. iphone 4gs, attachment_count=3 5. iphone 4, attachment_count=1 ... 11. iphone 5 case, attachment_count=100 Expected result: 1. iphone 5 32gb, attachment_count=10 2. iphone 5 16gb, attachment_count=5 3. iphone 4gs, attachment_count=3 4. iphone 4, attachment_count=1 5. iphone 5 32gb, attachment_count=0 ... 11. iphone 5 case, attachment_count=100 Is this possible using a function query? I am not sure how the results will look like but I want to try it out. -- Thanks, -Utkarsh
Re: How to use joins in solr 4.3.1
) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:662) 84343363 [qtp2012387303-17] ERROR org.apache.solr.core.SolrCore – org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://x:8983/solr/location returned non ok status:500, message:Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) 84343364 [qtp2012387303-17] INFO org.apache.solr.core.SolrCore – [location] webapp=/solr path=/select params={indent=trueq=*:*_=1373999505886wt=xmlfq={!join+from%3Dkey+to%3DmerchantId+fromIndex%3Dmerchant}} status=500 QTime=185 84343365 [qtp2012387303-17] ERROR org.apache.solr.servlet.SolrDispatchFilter – null:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://x:8983/solr/location returned non ok status:500, message:Server Error at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Thanks, -Utkarsh On Tue, Jul 16, 2013 at 5:24 AM, Erick Erickson erickerick...@gmail.comwrote: Not quite sure what's the problem with the second, but the first is: q=: That just isn't legal, try q=*:* As for the second, are there any other errors in the solr log? Sometimes what's returned in the response packet does not include the true source of the problem. Best Erick On Mon, Jul 15, 2013 at 7:40 PM, Utkarsh Sengar utkarsh2...@gmail.com
Re: How to use joins in solr 4.3.1
Found this post: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201302.mbox/%3CCAB_8Yd82aqq=oY6dBRmVjG7gvBBewmkZGF9V=fpne4xgkbu...@mail.gmail.com%3E And based on the answer, I modified my query: localhost:8983/solr/location/ select?fq={!join from=key to=merchantId fromIndex=merchant}*:* I don't see any errors, but my original problem still persists, no documents are returned. The two fields on which I am trying to join is: Merchant: field name=merchantId type=string indexed=true stored=true multiValued=false / Location: field name=merchantId type=string indexed=false stored=true multiValued=false / Thanks, -Utkarsh On Tue, Jul 16, 2013 at 11:39 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Looks like the JoinQParserPlugin is throwing an NPE. Query: localhost:8983/solr/location/select?q=*:*fq={!join from=key to=merchantId fromIndex=merchant} 84343345 [qtp2012387303-16] ERROR org.apache.solr.core.SolrCore – java.lang.NullPointerException at org.apache.solr.search.JoinQuery.hashCode(JoinQParserPlugin.java:580) at org.apache.solr.search.QueryResultKey.init(QueryResultKey.java:50) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:410) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:208) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:365) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:937) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:998) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:856) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:662) 84343350 [qtp2012387303-16] INFO org.apache.solr.core.SolrCore – [location] webapp=/solr path=/select params={distrib=falsewt=javabinversion=2rows=10df=allTextfl=key,scoreshard.url=x:8983/solr/location/NOW=1373999694930start=0q=*:*_=1373999505886isShard=truefq={!join+from%3Dkey+to%3DmerchantId+fromIndex%3Dmerchant}fsv=true} status=500 QTime=6 84343351 [qtp2012387303-16] ERROR org.apache.solr.servlet.SolrDispatchFilter – null:java.lang.NullPointerException at org.apache.solr.search.JoinQuery.hashCode(JoinQParserPlugin.java:580) at org.apache.solr.search.QueryResultKey.init(QueryResultKey.java:50) at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1274) at org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:457
How to use joins in solr 4.3.1
Hello, I am trying to join data between two cores: merchant and location This is my query: http://_server_.com:8983/solr/location/select?q={!join from=merchantId to=merchantId fromIndex=merchant}walgreens Ref: http://wiki.apache.org/solr/Join Merchants core has documents for the query: walgreens with an merchantId 1 A simple query: http://_server_.com:8983/solr/location/select?q=walgreens returns documents called walgreens with merchantId=1 Location core has documents with merchantId=1 too. But my join query returns no documents. This is the response I get: { responseHeader:{ status:0, QTime:5, params:{ debugQuery:true, indent:true, q:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, debug:{ rawquerystring:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, querystring:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, parsedquery:JoinQuery({!join from=merchantId to=merchantId fromIndex=merchant}allText:walgreens), parsedquery_toString:{!join from=merchantId to=merchantId fromIndex=merchant}allText:walgreens, QParser:, explain:{}}} Any suggestions? -- Thanks, -Utkarsh
Re: How to use joins in solr 4.3.1
I have also tried these queries (as per this SO answer: http://stackoverflow.com/questions/12665797/is-solr-4-0-capable-of-using-join-for-multiple-core ) 1. http://_server_.com:8983/solr/location/select?q=:fq={!join from=merchantId to=merchantId fromIndex=merchant}walgreens And I get this: { responseHeader:{ status:400, QTime:1, params:{ indent:true, q::, wt:json, fq:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens}}, error:{ msg:org.apache.solr.search.SyntaxError: Cannot parse ':': Encountered \ \:\ \: \\ at line 1, column 0.\nWas expecting one of:\nNOT ...\n\+\ ...\n\-\ ...\nBAREOPER ...\n \(\ ...\n\*\ ...\nQUOTED ...\nTERM ...\n PREFIXTERM ...\nWILDTERM ...\nREGEXPTERM ...\n\[\ ...\n\{\ ...\nLPARAMS ...\nNUMBER ...\nTERM ...\n\*\ ...\n, code:400}} And this: 2.http://_server_.com:8983/solr/location/select?q=walgreensfq={!join from=merchantId to=merchantId fromIndex=merchant} { responseHeader:{ status:500, QTime:5, params:{ indent:true, q:walgreens, wt:json, fq:{!join from=merchantId to=merchantId fromIndex=merchant}}}, error:{ msg:Server at http://_SERVER_:8983/solr/location returned non ok status:500, message:Server Error, trace:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server at http://_SERVER_:8983/solr/location returned non ok status:500, message:Server Error\n\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)\n\tat org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)\n\tat org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:156)\n\tat org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119)\n\tat java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:138)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)\n\tat java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:138)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)\n\tat java.lang.Thread.run(Thread.java:662)\n, code:500}} Thanks, -Utkarsh On Mon, Jul 15, 2013 at 4:27 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I am trying to join data between two cores: merchant and location This is my query: http://_server_.com:8983/solr/location/select?q={!join from=merchantId to=merchantId fromIndex=merchant}walgreens Ref: http://wiki.apache.org/solr/Join Merchants core has documents for the query: walgreens with an merchantId 1 A simple query: http://_server_.com:8983/solr/location/select?q=walgreens returns documents called walgreens with merchantId=1 Location core has documents with merchantId=1 too. But my join query returns no documents. This is the response I get: { responseHeader:{ status:0, QTime:5, params:{ debugQuery:true, indent:true, q:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, wt:json}}, response:{numFound:0,start:0,maxScore:0.0,docs:[] }, debug:{ rawquerystring:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, querystring:{!join from=merchantId to=merchantId fromIndex=merchant}walgreens, parsedquery:JoinQuery({!join from=merchantId to=merchantId fromIndex=merchant}allText:walgreens), parsedquery_toString:{!join from=merchantId to=merchantId fromIndex=merchant}allText:walgreens, QParser:, explain:{}}} Any suggestions? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Improving performance to return 2000+ documents
Thanks Erick/Jagdish. Just to give some background on my queries. 1. All my queries are unique. A query can be: ipod and ipod 8gb (but these are unique). These are about 1.2M in total. So, I assume setting a high queryResultCache, queryResultWindowSize and queryResultMaxDocsCached won't help. 2. I have this cache settings: documentCache class=solr.LRUCache size=1 initialSize=1 autowarmCount=0 cleanupThread=true/ //My understanding is, documentCache will help me the most because solr will cache documents retrieved. //Stats for documentCache: http://apaste.info/hknh queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0 cleanupThread=true/ //Default, since my queries are unique. filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ //Now sure how can I use filterCache, so I am keeping it as the default enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize100/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached I think the question can also be framed as: How can I optimize solr response time for 50M product catalog for unique queries which retrieves 2000 documents in one go. I looked at a solr search component, I think writing a proxy around solr was easier, so I went ahead with this approach. Thanks, -Utkarsh On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote: Solrconfig.xml has got entries which you can tweak for your use case. One of them is queryresultwindowsize. You can try using the value of 2000 and see if it helps improving performance. Please make sure you have enough memory allocated for queryresultcache. A combination of sharding and distribution of workload(requesting 2000/number of shards) with an aggregator would be a good way to maximize performance. Thanks, Jagdish On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com wrote: 50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell. But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and does the right thing. Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com wrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick
Re: Improving performance to return 2000+ documents
Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.comwrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com wrote: Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I have
Improving performance to return 2000+ documents
Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total transferred: 766221660 bytes HTML transferred: 766191806 bytes Requests per second:13.89 [#/sec] (mean) Time per request: 719.981 [ms] (mean) Time per request: 71.998 [ms] (mean, across all concurrent requests) Transfer rate: 20785.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.6 0 8 Processing: 9 717 2339.6199 12611 Waiting:9 635 2233.6164 12580 Total: 9 718 2339.6199 12611 Percentage of the requests served within a certain time (ms) 50%199 66%236 75%263 80%281 90%548 95%838 98% 12475 99% 12545 100% 12611 (longest request) -- Thanks, -Utkarsh
Re: Improving performance to return 2000+ documents
Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total transferred: 766221660 bytes HTML transferred: 766191806 bytes Requests per second:13.89 [#/sec] (mean) Time per request: 719.981 [ms] (mean) Time per request: 71.998 [ms] (mean, across all concurrent requests) Transfer rate: 20785.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.6 0 8 Processing: 9 717 2339.6199 12611 Waiting:9 635 2233.6164 12580 Total: 9 718 2339.6199 12611 Percentage of the requests served within a certain time (ms) 50%199 66%236 75%263 80%281 90%548 95%838 98% 12475 99% 12545 100% 12611 (longest request) -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Updating solrconfig and schema.xml for solrcloud in multicore setup
Hello, I am trying to update schema.xml for a core in a multicore setup and this is what I do to update it: I have 3 nodes in my solr cluster. 1. Pick node1 and manually update schema.xml 2. Restart node1 with -Dbootstrap_conf=true java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar 3. Restart the other 2 nodes using this command (without -Dbootstrap_conf=true since these should pull from zk).: java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar But, when I do that. node1 displays all of my cores and the other 2 nodes displays just one core. Then, I found this: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E Which says bootstrap_conf is used for multicore setup. But if I use bootstrap_conf for every node, then I will have to manually update schema.xml (for any config file) everywhere? That does not sound like an efficient way of managing configuration right? -- Thanks, -Utkarsh
Re: Updating solrconfig and schema.xml for solrcloud in multicore setup
But as when I launch a solr instance without -Dbootstrap_conf=true, just once core is launched and I cannot see the other core. This behavior is the same as Mark's reply here: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E - bootstrap_conf: you pass it true and it reads solr.xml and uploads the conf set for each SolrCore it finds, gives the conf set the name of the collection and associates each collection with the same named config set. So the first just lets you boot strap one collection easily...but what if you start with a multi-core, multi-collection setup that you want to bootstrap into SolrCloud? And they don't share a common config set? That's what the second command is for. You can setup 30 local SolrCores in solr.xml and then just bootstrap all 30 different config sets up and have them fully linked with each collection just by passing bootstrap_conf=true. Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir Thanks, -Utkarsh On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com wrote: Hi, The -Dbootstrap_confdir option is really only meant for a first-time bootstrap for your development environment, not for serious use. Once you got your config into ZK you should modify the config directly in ZK. There are many tools (also 3rd party) for this. But your best choice is probably zkCli shipping with Solr. See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util This means you will NOT need to start Solr with -Dboostrap_confdir at all. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com: Hello, I am trying to update schema.xml for a core in a multicore setup and this is what I do to update it: I have 3 nodes in my solr cluster. 1. Pick node1 and manually update schema.xml 2. Restart node1 with -Dbootstrap_conf=true java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar 3. Restart the other 2 nodes using this command (without -Dbootstrap_conf=true since these should pull from zk).: java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar But, when I do that. node1 displays all of my cores and the other 2 nodes displays just one core. Then, I found this: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E Which says bootstrap_conf is used for multicore setup. But if I use bootstrap_conf for every node, then I will have to manually update schema.xml (for any config file) everywhere? That does not sound like an efficient way of managing configuration right? -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Updating solrconfig and schema.xml for solrcloud in multicore setup
Yes, I have tried zkCli and it works. But I also need to restart solr after the schema change right? I tried to reload the core, but I think there is an open bug where a core reload is successful but a shard goes down for that core. I just tried it out, i.e tried to reload a core after config change via zkCli and a shard went down. Since I am not able to reload a core, I am restarting the whole solr process for make the change. Thanks, -Utkarsh On Tue, Jun 25, 2013 at 2:46 AM, Jan Høydahl jan@cominvent.com wrote: Hi, As I understand, your initial bootstrap works ok (boostrap_conf). What you want help with is *changing* the config on a live system. That's when you are encouraged to use zkCli and don't mess with trying to let Solr bootstrap things - after all it's not a bootstrap anymore, it's a change. Did you try updating schema.xml for a specific collection using zkCli? Any issues? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 11:24 skrev Utkarsh Sengar utkarsh2...@gmail.com: But as when I launch a solr instance without -Dbootstrap_conf=true, just once core is launched and I cannot see the other core. This behavior is the same as Mark's reply here: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E - bootstrap_conf: you pass it true and it reads solr.xml and uploads the conf set for each SolrCore it finds, gives the conf set the name of the collection and associates each collection with the same named config set. So the first just lets you boot strap one collection easily...but what if you start with a multi-core, multi-collection setup that you want to bootstrap into SolrCloud? And they don't share a common config set? That's what the second command is for. You can setup 30 local SolrCores in solr.xml and then just bootstrap all 30 different config sets up and have them fully linked with each collection just by passing bootstrap_conf=true. Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir Thanks, -Utkarsh On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com wrote: Hi, The -Dbootstrap_confdir option is really only meant for a first-time bootstrap for your development environment, not for serious use. Once you got your config into ZK you should modify the config directly in ZK. There are many tools (also 3rd party) for this. But your best choice is probably zkCli shipping with Solr. See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util This means you will NOT need to start Solr with -Dboostrap_confdir at all. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com: Hello, I am trying to update schema.xml for a core in a multicore setup and this is what I do to update it: I have 3 nodes in my solr cluster. 1. Pick node1 and manually update schema.xml 2. Restart node1 with -Dbootstrap_conf=true java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar 3. Restart the other 2 nodes using this command (without -Dbootstrap_conf=true since these should pull from zk).: java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar But, when I do that. node1 displays all of my cores and the other 2 nodes displays just one core. Then, I found this: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E Which says bootstrap_conf is used for multicore setup. But if I use bootstrap_conf for every node, then I will have to manually update schema.xml (for any config file) everywhere? That does not sound like an efficient way of managing configuration right? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Updating solrconfig and schema.xml for solrcloud in multicore setup
I believe I am hitting this bug: https://issues.apache.org/jira/browse/SOLR-4805 I am using solr 4.3.1 -Utkarsh On Tue, Jun 25, 2013 at 2:56 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Yes, I have tried zkCli and it works. But I also need to restart solr after the schema change right? I tried to reload the core, but I think there is an open bug where a core reload is successful but a shard goes down for that core. I just tried it out, i.e tried to reload a core after config change via zkCli and a shard went down. Since I am not able to reload a core, I am restarting the whole solr process for make the change. Thanks, -Utkarsh On Tue, Jun 25, 2013 at 2:46 AM, Jan Høydahl jan@cominvent.comwrote: Hi, As I understand, your initial bootstrap works ok (boostrap_conf). What you want help with is *changing* the config on a live system. That's when you are encouraged to use zkCli and don't mess with trying to let Solr bootstrap things - after all it's not a bootstrap anymore, it's a change. Did you try updating schema.xml for a specific collection using zkCli? Any issues? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 11:24 skrev Utkarsh Sengar utkarsh2...@gmail.com: But as when I launch a solr instance without -Dbootstrap_conf=true, just once core is launched and I cannot see the other core. This behavior is the same as Mark's reply here: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E - bootstrap_conf: you pass it true and it reads solr.xml and uploads the conf set for each SolrCore it finds, gives the conf set the name of the collection and associates each collection with the same named config set. So the first just lets you boot strap one collection easily...but what if you start with a multi-core, multi-collection setup that you want to bootstrap into SolrCloud? And they don't share a common config set? That's what the second command is for. You can setup 30 local SolrCores in solr.xml and then just bootstrap all 30 different config sets up and have them fully linked with each collection just by passing bootstrap_conf=true. Note: I am using -Dbootstrap_conf=true and not -Dbootstrap_confdir Thanks, -Utkarsh On Tue, Jun 25, 2013 at 2:14 AM, Jan Høydahl jan@cominvent.com wrote: Hi, The -Dbootstrap_confdir option is really only meant for a first-time bootstrap for your development environment, not for serious use. Once you got your config into ZK you should modify the config directly in ZK. There are many tools (also 3rd party) for this. But your best choice is probably zkCli shipping with Solr. See http://wiki.apache.org/solr/SolrCloud#Command_Line_Util This means you will NOT need to start Solr with -Dboostrap_confdir at all. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com 25. juni 2013 kl. 10:29 skrev Utkarsh Sengar utkarsh2...@gmail.com: Hello, I am trying to update schema.xml for a core in a multicore setup and this is what I do to update it: I have 3 nodes in my solr cluster. 1. Pick node1 and manually update schema.xml 2. Restart node1 with -Dbootstrap_conf=true java -Dsolr.solr.home=multicore -DnumShards=3 -Dbootstrap_conf=true -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar 3. Restart the other 2 nodes using this command (without -Dbootstrap_conf=true since these should pull from zk).: java -Dsolr.solr.home=multicore -DnumShards=3 -DzkHost=localhost:2181 -DSTOP.PORT=8079 -DSTOP.KEY=mysecret -jar start.jar But, when I do that. node1 displays all of my cores and the other 2 nodes displays just one core. Then, I found this: http://mail-archives.apache.org/mod_mbox/lucene-dev/201205.mbox/%3cbb7ad9bf-389b-4b94-8c1b-bbfc4028a...@gmail.com%3E Which says bootstrap_conf is used for multicore setup. But if I use bootstrap_conf for every node, then I will have to manually update schema.xml (for any config file) everywhere? That does not sound like an efficient way of managing configuration right? -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: solrcloud 4.3.1 - stability and failure scenario questions
Thanks! 1. shards.tolerant=true works, shouldn't this parameter be default? 2. Regarding zk, yes it should be outside the solr nodes and I am evaluating what difference does it make. 3. Regarding usecase: Daily queries will be about 100k to 200k, not much. The total data to be indexed is about 45M documents with a total size of 20GB. 3 nodes (sharded and RAM of 30GB each) with 3 replicas sounds like an overkill for this? Thanks, -Utkarsh On Sat, Jun 22, 2013 at 8:53 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Use shards.tolerant=true to return documents that are available in the shards that are still alive. Typically people setup ZooKeeper outside of Solr so that solr nodes can be added/removed easily independent of ZooKeeper plus it isolates ZK from large GC pauses due to Solr's garbage. See http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A7 Depending on you use-case, 2-3 replicas might be okay. We don't have enough information to answer that question. On Sat, Jun 22, 2013 at 10:40 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks Anshum. Sure, creating a replica will make it failure resistant, but death of one shard should not make the whole cluster unusable. 1/3rd of the keys hosted in the killed shard should be unavailable but others should be available. Right? Also, any suggestions on the recommended size of zk and solr cluster size and configuration? Example: 3 shards with 3 replicas and 3 zk processes running on the same solr mode sounds acceptable? (Total of 6 VMs) Thanks, -Utkarsh On Jun 22, 2013, at 4:20 AM, Anshum Gupta ans...@anshumgupta.net wrote: You need to have at least 1 replica from each shard for the SolrCloud setup to work for you. When you kill 1 shard, you essentially are taking away 1/3 of the range of shard key. On Sat, Jun 22, 2013 at 4:31 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are running in a different process in the same machines. I wanted to know the recommended size of a solrcloud cluster (min zk nodes?) This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455 And, I am not sure if I am hitting this frustrating bug or this is just a configuration error from my side. When I kill any *one* of the nodes, the whole cluster stops responding and I get this request when I query any one of the two alive nodes. { responseHeader:{ status:503, QTime:2, params:{ indent:true, q:*:*, wt:json}}, error:{ msg:no servers hosting shard: , code:503}} I see this exception: 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: no servers hosting shard: at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) -- Thanks, -Utkarsh -- Anshum Gupta http://www.anshumgupta.net -- Regards, Shalin Shekhar Mangar. -- Thanks, -Utkarsh
solrcloud 4.3.1 - stability and failure scenario questions
Hello, I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are running in a different process in the same machines. I wanted to know the recommended size of a solrcloud cluster (min zk nodes?) This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455 And, I am not sure if I am hitting this frustrating bug or this is just a configuration error from my side. When I kill any *one* of the nodes, the whole cluster stops responding and I get this request when I query any one of the two alive nodes. { responseHeader:{ status:503, QTime:2, params:{ indent:true, q:*:*, wt:json}}, error:{ msg:no servers hosting shard: , code:503}} I see this exception: 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: no servers hosting shard: at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) -- Thanks, -Utkarsh
Re: solrcloud 4.3.1 - stability and failure scenario questions
Just to be clear here, I when I say I killed a node. I just killed the solr process on that node. zk on all the 3 nodes were still running. Thanks, -Utkarsh On Sat, Jun 22, 2013 at 4:01 AM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are running in a different process in the same machines. I wanted to know the recommended size of a solrcloud cluster (min zk nodes?) This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455 And, I am not sure if I am hitting this frustrating bug or this is just a configuration error from my side. When I kill any *one* of the nodes, the whole cluster stops responding and I get this request when I query any one of the two alive nodes. { responseHeader:{ status:503, QTime:2, params:{ indent:true, q:*:*, wt:json}}, error:{ msg:no servers hosting shard: , code:503}} I see this exception: 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: no servers hosting shard: at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: solrcloud 4.3.1 - stability and failure scenario questions
Thanks Anshum. Sure, creating a replica will make it failure resistant, but death of one shard should not make the whole cluster unusable. 1/3rd of the keys hosted in the killed shard should be unavailable but others should be available. Right? Also, any suggestions on the recommended size of zk and solr cluster size and configuration? Example: 3 shards with 3 replicas and 3 zk processes running on the same solr mode sounds acceptable? (Total of 6 VMs) Thanks, -Utkarsh On Jun 22, 2013, at 4:20 AM, Anshum Gupta ans...@anshumgupta.net wrote: You need to have at least 1 replica from each shard for the SolrCloud setup to work for you. When you kill 1 shard, you essentially are taking away 1/3 of the range of shard key. On Sat, Jun 22, 2013 at 4:31 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I am testing a 3 node solrcloud cluster with 3 shards. 3 zk nodes are running in a different process in the same machines. I wanted to know the recommended size of a solrcloud cluster (min zk nodes?) This is the SolrCloud dump: https://gist.github.com/utkarsh2012/5840455 And, I am not sure if I am hitting this frustrating bug or this is just a configuration error from my side. When I kill any *one* of the nodes, the whole cluster stops responding and I get this request when I query any one of the two alive nodes. { responseHeader:{ status:503, QTime:2, params:{ indent:true, q:*:*, wt:json}}, error:{ msg:no servers hosting shard: , code:503}} I see this exception: 952399 [qtp516992923-74] ERROR org.apache.solr.core.SolrCore – org.apache.solr.common.SolrException: no servers hosting shard: at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) at org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) -- Thanks, -Utkarsh -- Anshum Gupta http://www.anshumgupta.net
Re: Solr4 cluster setup for high performance reads
Thanks for the update guys, I am working on the suggestions shared by you. One last question about the solrcloud setup. What is the recommended cluster size for solrcloud? I have 3 nodes of solr and 3 nodes of ZK (running on the same machine, but a different JVM). And after 2-3 days I notice that zk returns one node is down, but everything is fine on that machine. And then I get this error when I query any node: no servers hosting shard: solr. This has definitiely has to do with my setup, even if one node goes down, the whole cluster should not start barfing. Suggestions? Thanks, -Utkarsh On Thu, Jun 13, 2013 at 7:28 PM, Shawn Heisey s...@elyograg.org wrote: On 6/13/2013 7:51 PM, Utkarsh Sengar wrote: Sure, I will reduce the count and see how it goes. The problem I have is, after such a change, I need to reindex everything again, which again is slow and takes time (40-60hours). There should be no need to reindex after changing most things in solrconfig.xml. Changing cache sizes does not require it. Most of the time, reindexing is only required after changing schema.xml, but there are a few changes you can make to schema that don't require it. Some queries are really bad, like this one: http://explain.solr.pl/explains/bzy034qi How can this be improved? I understand that there is something horribly wrong here, but not sure what points to look at (Been using solr from the last 20 days). You are using a *LOT* of query clauses against your allText field in that boost query. I assume that allText is your largest field. I'm not really sure, but based on what we're seeing here, I bet that a bq parameter doesn't get cached. With some additional RAM available, this might not be such a big problem. The query is simple, although it used edismax. I have shared an explain query above. Other than the query, this is my performance stats: iostat -m 5 result: http://apaste.info/hjNV top result: http://apaste.info/jlHN You've got a pretty well-sustained iowait around ten percent. You are I/O bound. You need more total RAM. With indexing only happening once a day, that doesn't sound like it's a factor. If you are also having problems with garbage collection because your heap is a little bit too small, that makes all the other problems worse. For the initial training, I will hit solr 1.3M times and request 2000 documents in each query. By the current speed (just one machine), it will take me ~20 days to do the initial training. This is really mystifying. There is no need to send a million plus queries to warm your index. A few dozen or a few hundred queries should be all you need, and you don't need 2000 docs returned per query. Go with ten rows, or maybe a few dozen rows at most. Because you're using SSD, I'm not sure you need warming queries at all. Thanks, Shawn -- Thanks, -Utkarsh
Re: Running solr cloud
Looks like zk does not contain the configuration called: collection1. You can use zkCli.sh to see what's inside configs zk node. You can manually push config via zkCli's upconfig (not very sure how it works). Try adding this arg: -Dbootstrap_conf=true in place of -Dbootstrap_confdir=./solr/collection1/conf and start solr. This might push the config to zk. bootstrap_conf uploads the index configuration files for all the cores to zk. Thanks, -Utkarsh On Tue, Jun 18, 2013 at 4:49 AM, Daniel Mosesson daniel.moses...@ipreo.comwrote: I cannot seem to be able to get the default cloud setup to work properly. What I did: Downloaded the binaries, extracted. Made the pwd example Ran: java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar And got the error message: Caused by: org.apache.solr.common.cloud.ZooKeeperException: Specified config does not exist in ZooKeeper:collection1 Which caused follow up messages, etc. What am I doing wrong here? Windows 7 pro ** This e-mail message and any attachments are confidential. Dissemination, distribution or copying of this e-mail or any attachments by anyone other than the intended recipient is prohibited. If you are not the intended recipient, please notify Ipreo immediately by replying to this e-mail, and destroy all copies of this e-mail and any attachments. Thank you! ** -- Thanks, -Utkarsh
Solr4 cluster setup for high performance reads
Hello, I am evaluating solr for indexing about 45M product catalog info. Catalog mainly contains title and description which takes most of the space (other attributes are brand, category, price, etc) The data is stored in cassandra and I am using datastax's solr (DSE 3.0.2) which handles incremental updates. The column family I am indexing is about 50GB in size and solr.data's size is about 15GB for now. *Points of interest in solr config/schema:* 1. schema.xml has a copyField called allText which merges title and description. 2. solconfig has the following config: directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.MMapDirectoryFactory}/ indexConfig filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=10/ documentCache class=solr.LRUCache size=5000 initialSize=500 autowarmCount=0/ *Relevancy:* Now, the default text matching does not suite our search needs, so I have implemented a wrapper around the Solr API which adds boost queries to the default solr query. For example: Original query: ipod Final Query: allText:ipod^1000, allText:apple^1000, allText:music^950 etc. So as you can see, I construct new query based on related keywords and assign score to those keywords based on relevance. This approach looks good and the results look relevant. But I am having issues with *Solr performance*. *Problems:* The initial training pulls 2000 documents from solr to find the most probable matches and calculates score (PMI/NPMI). This query is extremely slow. Also, a regular query also takes 3-4 seconds. I am running solr currently on just one VM with 12GB RAM and 8GB of Heap space is allocated to solr, the block storage is an SSD. What is the suggested setup for this usecase? My guess is, setting up 4 solr nodes will help, but what is the suggested RAM/heap for this kind of data? And what are the recommended configuration (solrconfig.xml) where I *need to speed up reads*? Also, is there a way I can debug what is going on with solr internally? As you can see, my queries are not that complex, so I don't need to debug my queries but just debug solr and see the troubled pieces in it. Also, I am new to solr, so there anything else which I missed to share which would help debug the problem? -- Thanks, -Utkarsh
Re: Solr4 cluster setup for high performance reads
Otis,Shawn, Thanks for reply. You can find my schema.xml and solrconfig.xml here: https://gist.github.com/utkarsh2012/5778811 To answer your questions: Those are massive caches. Rethink their size. More specifically, plug in some monitoring tool and see what you are getting out of them. Just today I looked at one Sematext's client's caches - 200K entries, 0 evictions == needless waste of JVM heap. So lower those numbers and increase only if you are getting evictions. Sure, I will reduce the count and see how it goes. The problem I have is, after such a change, I need to reindex everything again, which again is slow and takes time (40-60hours). debugQuery=true output will tell you something about timings, etc. Some queries are really bad, like this one: http://explain.solr.pl/explains/bzy034qi How can this be improved? I understand that there is something horribly wrong here, but not sure what points to look at (Been using solr from the last 20 days). consider edismax and qf param instead of that field copy stuff, info on zee Wiki Related back to my last point, how can such a query be improved? Maybe using qf? back to monitoring - what is your bottleneck? The query looks simplistic. Is it IO? Memory? CPU? Share some graphs and let's look. The query is simple, although it used edismax. I have shared an explain query above. Other than the query, this is my performance stats: iostat -m 5 result: http://apaste.info/hjNV top result: http://apaste.info/jlHN How often do you index and commit, and how many documents each time? This is done by datastax's dse. I assume it is configurable via solrconfig.xml. The updates to cassandra are daily but all the documents are not updated. What is your query rate? For the initial training, I will hit solr 1.3M times and request 2000 documents in each query. By the current speed (just one machine), it will take me ~20 days to do the initial training. Thanks, -Utkarsh On Thu, Jun 13, 2013 at 6:25 PM, Shawn Heisey s...@elyograg.org wrote: On 6/13/2013 5:53 PM, Utkarsh Sengar wrote: *Problems:* The initial training pulls 2000 documents from solr to find the most probable matches and calculates score (PMI/NPMI). This query is extremely slow. Also, a regular query also takes 3-4 seconds. I am running solr currently on just one VM with 12GB RAM and 8GB of Heap space is allocated to solr, the block storage is an SSD. Normally, I would say that you should have as much RAM as your heap size plus your index size, so with your 8GB heap and 15GB index, you'd want 24GB total RAM. With SSD, that requirement should not be quite so high, but you might want to try 16GB or more. Solr works much better on bare metal than it does on virtual machines. I suspect that what might be happening here is that your heap is just a little bit too small for the combination of your index size (both document count and disk space), how you use Solr, and your config, so your JVM is constantly doing garbage collections. What is the suggested setup for this usecase? My guess is, setting up 4 solr nodes will help, but what is the suggested RAM/heap for this kind of data? And what are the recommended configuration (solrconfig.xml) where I *need to speed up reads*? http://wiki.apache.org/solr/SolrPerformanceProblems http://wiki.apache.org/solr/SolrPerformanceFactors Heap size requirements are hard to predict. I can tell you that it's highly unlikely that you will need cache sizes as large as you have configured. Start with the defaults and only increase them (by small amounts) if your hitratio is not high enough. If increasing the size doesn't increase hitratio, there may be another problem. Also, is there a way I can debug what is going on with solr internally? As you can see, my queries are not that complex, so I don't need to debug my queries but just debug solr and see the troubled pieces in it. If you add debugQuery=true to your URL, Solr will give you a lot of extra information in the response. One of the things that would be important here is seeing how much time is spent in various components. Also, I am new to solr, so there anything else which I missed to share which would help debug the problem? Sharing the entire config, schema, examples of all fields from your indexed documents, and examples of your full queries would help. http://apaste.info How often do you index and commit, and how many documents each time? What is your query rate? Thanks, Shawn -- Thanks, -Utkarsh
Not able to see newly added copyField in the response (indexing is 80% complete)
Hello, I updated my schema to use a copyField and have triggered a reindex, 80% of the reindexing is complete. Although when I query the data, I don't see myNewCopyFieldName being returned with the documents. Is there something wrong with my schema or I need to wait for the indexing to complete to see the new copyField? This is my schema (retracted the actual names): fields field name=key type=string indexed=true stored=true/ field name=1 type=string indexed=true stored=true/ field name=2 type=string indexed=true stored=true/ field name=3 type=string indexed=false stored=true/ field name=4 type=string indexed=true stored=true/ field name=5 type=string indexed=true stored=true/ field name=6 type=custom_type indexed=true stored=true/ field name=7 type=text_general indexed=true stored=true/ field name=8 type=string indexed=true stored=true/ field name=9 type=text_general indexed=true stored=true/ field name=10 type=text_general indexed=true stored=true/ field name=11 type=string indexed=true stored=true/ field name=12 type=string indexed=true stored=true/ field name=13 type=string indexed=true stored=true/ field name=myNewCopyFieldName type=text_general indexed=true stored=true multiValued=true/ /fields defaultSearchField4/defaultSearchFielduniqueKeykey/uniqueKey copyField source=1 dest=myNewCopyFieldName/copyField source=2 dest=myNewCopyFieldName/copyField source=3 dest=myNewCopyFieldName/copyField source=4 dest=myNewCopyFieldName/copyField source=6 dest=myNewCopyFieldName/ Where: fieldType name=custom_type class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer/fieldType and fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer/fieldType -- Thanks, -Utkarsh
Re: Not able to see newly added copyField in the response (indexing is 80% complete)
Thanks Shawn. Find my answers below. On Thu, May 2, 2013 at 2:34 PM, Shawn Heisey s...@elyograg.org wrote: On 5/2/2013 3:13 PM, Utkarsh Sengar wrote: Hello, I updated my schema to use a copyField and have triggered a reindex, 80% of the reindexing is complete. Although when I query the data, I don't see myNewCopyFieldName being returned with the documents. Is there something wrong with my schema or I need to wait for the indexing to complete to see the new copyField? After making sure that you restarted Solr (or reloaded the core) after changing your schema, there are two things to mention: Yes, I restarted solr and also did a reload. 1) Using stored=true with a copyField doesn't make any sense, because you already have the individual values stored with the source fields. I haven't done any testing, but Solr might ignore stored=true on copyField fields. Ah I see, didn't know about this. If its not stored then it makes sense. Need to verify this though. 2) If I'm wrong about how Solr behaves with stored=true on a copyField, then a soft commit (4.x and later) or a hard commit with openSearcher=true would be required to see changes from indexing. Have you committed your updates yet? I am using Solr 4.x and soft commit is enabled. So I assume commit happened. I see this in my solr admin: - lastModified:less than a minute ago - version:453962 - numDocs:26413743 - maxDoc: 28322675 - current: - indexing: yes So, lastModified = less than minute means the change was committed right? Thanks, Shawn -- Thanks, -Utkarsh
How to recover from Error opening new searcher when machine crashed while indexing
Solr 4.0 was indexing data and the machine crashed. Any suggestions on how to recover my index since I don't want to delete my data directory? When I try to start it again, I get this error: ERROR 12:01:46,493 Failed to load Solr core: xyz.index1 ERROR 12:01:46,493 Cause: ERROR 12:01:46,494 Error opening new searcher org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.init(SolrCore.java:701) at org.apache.solr.core.SolrCore.init(SolrCore.java:564) at org.apache.solr.core.CassandraCoreContainer.load(CassandraCoreContainer.java:213) at com.datastax.bdp.plugin.SolrCorePlugin.activateImpl(SolrCorePlugin.java:66) at com.datastax.bdp.plugin.PluginManager$PluginInitializer.call(PluginManager.java:161) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1290) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1402) at org.apache.solr.core.SolrCore.init(SolrCore.java:675) ... 9 more Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in NRTCachingDirectory(org.apache.lucene.store.NIOFSDirectory@/media/SSD/data/solr.data/rlcatalogks.prodinfo/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@d7581b; maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_73ne_nrm.cfs, _73ng_Lucene40_0.tip, _73nh_nrm.cfs, _73ng_Lucene40_0.tim, _73nf.fnm, _73n5_Lucene40_0.frq, _73ne.fdt, _73nh.fdx, _73ne_nrm.cfe, _73ne.fdx, _73ne_Lucene40_0.tim, _73ne.si, _73ni.fnm, _73nh_Lucene40_0.prx, _73ni.fdt, _73n5.si, _73ne_Lucene40_0.tip, _73nf_Lucene40_0.frq, _73nf_Lucene40_0.prx, _73nf_nrm.cfe, _73ne_Lucene40_0.frq, _73ng_Lucene40_0.prx, _73nf_Lucene40_0.tip, _73n5.fdx, _73ng_Lucene40_0.frq, _73ng.fnm, _73ni.fdx, _73n5.fnm, _73nf_Lucene40_0.tim, _73ni.si, _73n5.fdt, _73nf_nrm.cfs, _73nh_nrm.cfe, _73ni_Lucene40_0.frq, _73ng.fdx, _73ne_Lucene40_0.prx, _73nh.fnm, _73nh_Lucene40_0.tip, _73nh_Lucene40_0.tim, _73nh.si, _73n5_Lucene40_0.tip, _73ni_Lucene40_0.prx, _73n5_Lucene40_0.tim, _73nf.si, _73ng_nrm.cfe, _73n5_Lucene40_0.prx, _392j_42f.del, _73ng.fdt, _73ng.si, _73ni_nrm.cfe, _73n5_nrm.cfe, _73ni_nrm.cfs, _73nf.fdx, _73ni_Lucene40_0.tip, _73n5_nrm.cfs, _73ni_Lucene40_0.tim, _73nf.fdt, _73ne.fnm, _73nh.fdt, _73nh_Lucene40_0.frq, _73ng_nrm.cfs] -- Thanks, -Utkarsh
Solr's physical memory and JVM memory
Hello, I have setup a solr4 instance (just one node) and I see this memory pattern: [image: Inline image 1] Physical memory is nearly full and JVM memory is ok. I have ~40M documents (where 1 document=1KB) indexed and in production env I am planning to setup 2 solr cloud nodes. So I have 2 questions: 1. What is the recommended memory for those 2 nodes? 2. I am not sure what does Physical memory mean in context to solr. My understanding of the physical memory is the actual RAM in my machine and 'top' says that I have used just 4.6GB or 23.7GB. Why is Solr admin reporting that I have used 22.84GB out of 23.7GB? -- Thanks, -Utkarsh
Re: Solr's physical memory and JVM memory
My bad about the attachment, there you go: http://i.imgur.com/XKtw32K.png Thanks for the details answer and that helps alot. Thank, -Utkarsh On Tue, Apr 16, 2013 at 9:48 PM, Shawn Heisey s...@elyograg.org wrote: On 4/16/2013 10:01 PM, Otis Gospodnetic wrote: Not sure if it's just me, but I'm not seeing your inlined image. It's not just you. On Tue, Apr 16, 2013 at 7:52 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: So I have 2 questions: 1. What is the recommended memory for those 2 nodes? 2. I am not sure what does Physical memory mean in context to solr. My understanding of the physical memory is the actual RAM in my machine and 'top' says that I have used just 4.6GB or 23.7GB. Why is Solr admin reporting that I have used 22.84GB out of 23.7GB? Attachments don't work well on mailing lists. We can't see your image. Best to put the file on the Internet somewhere (like dropbox or another file sharing site) and include the public link. After you get an answer to your question, you can remove the file. Answers to your two questions: 1) A good rule of thumb is that you want to have enough RAM to equal or exceed the sum of two things: The amount of memory that your programs take (including the max heap setting you give to Solr), and the size of your Solr index(es) stored on that server. You may be able to get away with less memory than this, but you do want to have enough memory for a sizable chunk of your on-disk index. Example: If Solr is the only major program running on the machine, you give Solr a 4GB heap, and your index is 20GB, an ideal setup would have at least 24GB of RAM. 2) You are seeing the result of the way that all modern operating systems work. The extra memory that is not being currently used by programs is borrowed by the operating system to cache data from your disk into RAM, so that frequently accessed data will not have be read from the disk. Reading from main memory is many orders of magnitude faster than reading from disk. The memory that is being used for the disk cache (on top it shows up as 'cached') is instantly made available to programs that request it. http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 2a) Operating systems like Linux tell you the truth about the OS using excess memory for the disk cache. With the most basic information tools, Windows tells you a semi-lie and will report that memory as free. The newest versions of Windows seem to have gotten the hint and do include tools that will give you the true picture. 2b) For good performance, Solr is extremely reliant on having a big enough disk cache so that reads from disk are rare. This is the case for most other programs too, actually. Thanks, Shawn -- Thanks, -Utkarsh
Getting started with solr 4.2 and cassandra
Hello, I am evaluating solr 4.2 and ElasticSearch (I am new to both) for a search API, where data sits in cassandra. Getting started with elasticsearch is pretty straight forward and I was able to write an ES riverhttp://www.elasticsearch.org/guide/reference/river/ which pulls data from cassandra and indexes it in ES within a day. Now, I trying to implement something similar with solr and compare both of them. Getting started with solr/examplehttp://lucene.apache.org/solr/4_2_0/tutorial.htmlwas pretty easy and an example solr instance works. But the example folder contains whole bunch of stuff which I am not sure if I need: http://pastebin.com/Gv660mRT . I am sure I don't need 53 directories and 527 files So my questions are: 1. How can I create a bare bone solr app up and running with minimum set of configuration? (I will build over it when needed by taking reference from /example) 2. What is a best practice to run solr in production? Am approach like this jetty+nginx recommended: http://sacharya.com/nginx-proxy-to-jetty-for-java-apps/ ? Once I am done setting up a simple solr instance: 3. What is the general practice to import data to solr? For now, I am writing a python script which will read data in bulk from cassandra and throw it to solr. -- Thanks, -Utkarsh
Re: Getting started with solr 4.2 and cassandra
Thanks for the reply. So DSE is one of the options and I am looking into that too. Although, before diving into solr+cassandra integration (which comes out of the box with DSE). I am just trying to setup a solr instance on my local machine without the bloat the example solr instance has to offer. Any suggestions about that? Thanks, -Utkarsh On Mon, Apr 1, 2013 at 4:00 PM, Jack Krupansky j...@basetechnology.comwrote: You might want to check out DataStax Enterprise, which actually integrates Cassandra and Solr. You keep the data in Cassandra, but as data is added and updated and deleted, the Solr index is automatically updated in parallel. You can add and update data and query using either the Cassandra API or the Solr API. See: http://www.datastax.com/what-**we-offer/products-services/** datastax-enterprisehttp://www.datastax.com/what-we-offer/products-services/datastax-enterprise -- Jack Krupansky -Original Message- From: Utkarsh Sengar Sent: Monday, April 01, 2013 6:34 PM To: solr-user@lucene.apache.org Subject: Getting started with solr 4.2 and cassandra Hello, I am evaluating solr 4.2 and ElasticSearch (I am new to both) for a search API, where data sits in cassandra. Getting started with elasticsearch is pretty straight forward and I was able to write an ES riverhttp://www.**elasticsearch.org/guide/**reference/river/http://www.elasticsearch.org/guide/reference/river/ which pulls data from cassandra and indexes it in ES within a day. Now, I trying to implement something similar with solr and compare both of them. Getting started with solr/examplehttp://lucene.**apache.org/solr/4_2_0/**tutorial.htmlhttp://lucene.apache.org/solr/4_2_0/tutorial.html was pretty easy and an example solr instance works. But the example folder contains whole bunch of stuff which I am not sure if I need: http://pastebin.com/Gv660mRT . I am sure I don't need 53 directories and 527 files So my questions are: 1. How can I create a bare bone solr app up and running with minimum set of configuration? (I will build over it when needed by taking reference from /example) 2. What is a best practice to run solr in production? Am approach like this jetty+nginx recommended: http://sacharya.com/nginx-**proxy-to-jetty-for-java-apps/http://sacharya.com/nginx-proxy-to-jetty-for-java-apps/? Once I am done setting up a simple solr instance: 3. What is the general practice to import data to solr? For now, I am writing a python script which will read data in bulk from cassandra and throw it to solr. -- Thanks, -Utkarsh -- Thanks, -Utkarsh