Re: Delete from Solr Cloud 4.0 index..
bq: Will docValues help with memory usage? 'm still a bit fuzzy on all the ramifications of DocValues, but I somewhat doubt they'll result in index size savings, they _really_ help with loading the values for a field, but the end result is still the values in memory People who know what they're talking about, _please_ correct this if I'm off base. Sure, stored field compression will help with disk space, no question. I was mostly cautioning against extrapolating from disk size to memory requirements without taking this into account. Best Erick Best Erick On Tue, May 7, 2013 at 6:46 AM, Annette Newton wrote: > Hi Erick, > > Thanks for the tip. > > Will docValues help with memory usage? It seemed a bit complicated to set > up.. > > The index size saving was nice because that means that potentially I could > use smaller provisioned IOP volumes which cost less... > > Thanks. > > > On 3 May 2013 18:27, Erick Erickson wrote: > >> Anette: >> >> Be a little careful with the index size savings, they really don't >> mean much for _searching_. The sotred field compression >> significantly reduces the size on disk, but only for the stored >> data which is only accessed when returning the top N docs. In >> terms of how many docs you can fit on your hardware, it's pretty >> irrelevant. >> >> The *.fdt and *.fdx files in your index directory contain the stored >> data, so when looking at the effects of various options (including >> compression), you can pretty much ignore these files. >> >> FWIW, >> Erick >> >> On Fri, May 3, 2013 at 2:03 AM, Annette Newton >> wrote: >> > Thanks Shawn. >> > >> > I have played around with Soft Commits before and didn't seem to have any >> > improvement, but with the current load testing I am doing I will give it >> > another go. >> > >> > I have researched docValues and came across the fact that it would >> increase >> > the index size. With the upgrade to 4.2.1 the index size has reduced by >> > approx 33% which is pleasing and I don't really want to lose that saving. >> > >> > We do use the facet.enum method - which works really well, but I will >> > verify that we are using that in every instance, we have numerous >> > developers working on the product and maybe one or two have slipped >> > through. >> > >> > Right from the first I upped the zkClientTimeout to 30 as I wanted to >> give >> > extra time for any network blips that we experience on AWS. We only seem >> > to drop communication on a full garbage collection though. >> > >> > I am coming to the conclusion that we need to have more shards to cope >> with >> > the writes, so I will play around with adding more shards and see how I >> go. >> > >> > >> > I appreciate you having a look over our setup and the advice. >> > >> > Thanks again. >> > >> > Netty. >> > >> > >> > On 2 May 2013 23:17, Shawn Heisey wrote: >> > >> >> On 5/2/2013 4:24 AM, Annette Newton wrote: >> >> > Hi Shawn, >> >> > >> >> > Thanks so much for your response. We basically are very write >> intensive >> >> > and write throughput is pretty essential to our product. Reads are >> >> > sporadic and actually is functioning really well. >> >> > >> >> > We write on average (at the moment) 8-12 batches of 35 documents per >> >> > minute. But we really will be looking to write more in the future, so >> >> need >> >> > to work out scaling of solr and how to cope with more volume. >> >> > >> >> > Schema (I have changed the names) : >> >> > >> >> > http://pastebin.com/x1ry7ieW >> >> > >> >> > Config: >> >> > >> >> > http://pastebin.com/pqjTCa7L >> >> >> >> This is very clean. There's probably more you could remove/comment, but >> >> generally speaking I couldn't find any glaring issues. In particular, >> >> you have disabled autowarming, which is a major contributor to commit >> >> speed problems. >> >> >> >> The first thing I think I'd try is increasing zkClientTimeout to 30 or >> >> 60 seconds. You can use the startup commandline or solr.xml, I would >> >> probably use the latter. Here's a solr.xml fragment that uses a system >> >> property or a 15 second default: >> >> >> >> >> >> >> >> > >> zkClientTimeout="${zkClientTimeout:15000}" hostPort="${jetty.port:}" >> >> hostContext="solr"> >> >> >> >> General thoughts, these changes might not help this particular issue: >> >> You've got autoCommit with openSearcher=true. This is a hard commit. >> >> If it were me, I would set that up with openSearcher=false and either do >> >> explicit soft commits from my application or set up autoSoftCommit with >> >> a shorter timeframe than autoCommit. >> >> >> >> This might simply be a scaling issue, where you'll need to spread the >> >> load wider than four shards. I know that there are financial >> >> considerations with that, and they might not be small, so let's leave >> >> that alone for now. >> >> >> >> The memory problems might be a symptom/cause of the scaling issue I just >> >> mentioned. You said you're using facets, which can be a real memory h
Re: Delete from Solr Cloud 4.0 index..
Hi Erick, Thanks for the tip. Will docValues help with memory usage? It seemed a bit complicated to set up.. The index size saving was nice because that means that potentially I could use smaller provisioned IOP volumes which cost less... Thanks. On 3 May 2013 18:27, Erick Erickson wrote: > Anette: > > Be a little careful with the index size savings, they really don't > mean much for _searching_. The sotred field compression > significantly reduces the size on disk, but only for the stored > data which is only accessed when returning the top N docs. In > terms of how many docs you can fit on your hardware, it's pretty > irrelevant. > > The *.fdt and *.fdx files in your index directory contain the stored > data, so when looking at the effects of various options (including > compression), you can pretty much ignore these files. > > FWIW, > Erick > > On Fri, May 3, 2013 at 2:03 AM, Annette Newton > wrote: > > Thanks Shawn. > > > > I have played around with Soft Commits before and didn't seem to have any > > improvement, but with the current load testing I am doing I will give it > > another go. > > > > I have researched docValues and came across the fact that it would > increase > > the index size. With the upgrade to 4.2.1 the index size has reduced by > > approx 33% which is pleasing and I don't really want to lose that saving. > > > > We do use the facet.enum method - which works really well, but I will > > verify that we are using that in every instance, we have numerous > > developers working on the product and maybe one or two have slipped > > through. > > > > Right from the first I upped the zkClientTimeout to 30 as I wanted to > give > > extra time for any network blips that we experience on AWS. We only seem > > to drop communication on a full garbage collection though. > > > > I am coming to the conclusion that we need to have more shards to cope > with > > the writes, so I will play around with adding more shards and see how I > go. > > > > > > I appreciate you having a look over our setup and the advice. > > > > Thanks again. > > > > Netty. > > > > > > On 2 May 2013 23:17, Shawn Heisey wrote: > > > >> On 5/2/2013 4:24 AM, Annette Newton wrote: > >> > Hi Shawn, > >> > > >> > Thanks so much for your response. We basically are very write > intensive > >> > and write throughput is pretty essential to our product. Reads are > >> > sporadic and actually is functioning really well. > >> > > >> > We write on average (at the moment) 8-12 batches of 35 documents per > >> > minute. But we really will be looking to write more in the future, so > >> need > >> > to work out scaling of solr and how to cope with more volume. > >> > > >> > Schema (I have changed the names) : > >> > > >> > http://pastebin.com/x1ry7ieW > >> > > >> > Config: > >> > > >> > http://pastebin.com/pqjTCa7L > >> > >> This is very clean. There's probably more you could remove/comment, but > >> generally speaking I couldn't find any glaring issues. In particular, > >> you have disabled autowarming, which is a major contributor to commit > >> speed problems. > >> > >> The first thing I think I'd try is increasing zkClientTimeout to 30 or > >> 60 seconds. You can use the startup commandline or solr.xml, I would > >> probably use the latter. Here's a solr.xml fragment that uses a system > >> property or a 15 second default: > >> > >> > >> > >>>> zkClientTimeout="${zkClientTimeout:15000}" hostPort="${jetty.port:}" > >> hostContext="solr"> > >> > >> General thoughts, these changes might not help this particular issue: > >> You've got autoCommit with openSearcher=true. This is a hard commit. > >> If it were me, I would set that up with openSearcher=false and either do > >> explicit soft commits from my application or set up autoSoftCommit with > >> a shorter timeframe than autoCommit. > >> > >> This might simply be a scaling issue, where you'll need to spread the > >> load wider than four shards. I know that there are financial > >> considerations with that, and they might not be small, so let's leave > >> that alone for now. > >> > >> The memory problems might be a symptom/cause of the scaling issue I just > >> mentioned. You said you're using facets, which can be a real memory hog > >> even with only a few of them. Have you tried facet.method=enum to see > >> how it performs? You'd need to switch to it exclusively, never go with > >> the default of fc. You could put that in the defaults or invariants > >> section of your request handler(s). > >> > >> Another way to reduce memory usage for facets is to use disk-based > >> docValues on version 4.2 or later for the facet fields, but this will > >> increase your index size, and your index is already quite large. > >> Depending on your index contents, the increase may be small or large. > >> > >> Something to just mention: It looks like your solrconfig.xml has > >> hard-coded absolute paths for dataDir and updateLog. This is fine if > >> you'll only ever have one core/
Re: Delete from Solr Cloud 4.0 index..
Anette: Be a little careful with the index size savings, they really don't mean much for _searching_. The sotred field compression significantly reduces the size on disk, but only for the stored data which is only accessed when returning the top N docs. In terms of how many docs you can fit on your hardware, it's pretty irrelevant. The *.fdt and *.fdx files in your index directory contain the stored data, so when looking at the effects of various options (including compression), you can pretty much ignore these files. FWIW, Erick On Fri, May 3, 2013 at 2:03 AM, Annette Newton wrote: > Thanks Shawn. > > I have played around with Soft Commits before and didn't seem to have any > improvement, but with the current load testing I am doing I will give it > another go. > > I have researched docValues and came across the fact that it would increase > the index size. With the upgrade to 4.2.1 the index size has reduced by > approx 33% which is pleasing and I don't really want to lose that saving. > > We do use the facet.enum method - which works really well, but I will > verify that we are using that in every instance, we have numerous > developers working on the product and maybe one or two have slipped > through. > > Right from the first I upped the zkClientTimeout to 30 as I wanted to give > extra time for any network blips that we experience on AWS. We only seem > to drop communication on a full garbage collection though. > > I am coming to the conclusion that we need to have more shards to cope with > the writes, so I will play around with adding more shards and see how I go. > > > I appreciate you having a look over our setup and the advice. > > Thanks again. > > Netty. > > > On 2 May 2013 23:17, Shawn Heisey wrote: > >> On 5/2/2013 4:24 AM, Annette Newton wrote: >> > Hi Shawn, >> > >> > Thanks so much for your response. We basically are very write intensive >> > and write throughput is pretty essential to our product. Reads are >> > sporadic and actually is functioning really well. >> > >> > We write on average (at the moment) 8-12 batches of 35 documents per >> > minute. But we really will be looking to write more in the future, so >> need >> > to work out scaling of solr and how to cope with more volume. >> > >> > Schema (I have changed the names) : >> > >> > http://pastebin.com/x1ry7ieW >> > >> > Config: >> > >> > http://pastebin.com/pqjTCa7L >> >> This is very clean. There's probably more you could remove/comment, but >> generally speaking I couldn't find any glaring issues. In particular, >> you have disabled autowarming, which is a major contributor to commit >> speed problems. >> >> The first thing I think I'd try is increasing zkClientTimeout to 30 or >> 60 seconds. You can use the startup commandline or solr.xml, I would >> probably use the latter. Here's a solr.xml fragment that uses a system >> property or a 15 second default: >> >> >> >> > zkClientTimeout="${zkClientTimeout:15000}" hostPort="${jetty.port:}" >> hostContext="solr"> >> >> General thoughts, these changes might not help this particular issue: >> You've got autoCommit with openSearcher=true. This is a hard commit. >> If it were me, I would set that up with openSearcher=false and either do >> explicit soft commits from my application or set up autoSoftCommit with >> a shorter timeframe than autoCommit. >> >> This might simply be a scaling issue, where you'll need to spread the >> load wider than four shards. I know that there are financial >> considerations with that, and they might not be small, so let's leave >> that alone for now. >> >> The memory problems might be a symptom/cause of the scaling issue I just >> mentioned. You said you're using facets, which can be a real memory hog >> even with only a few of them. Have you tried facet.method=enum to see >> how it performs? You'd need to switch to it exclusively, never go with >> the default of fc. You could put that in the defaults or invariants >> section of your request handler(s). >> >> Another way to reduce memory usage for facets is to use disk-based >> docValues on version 4.2 or later for the facet fields, but this will >> increase your index size, and your index is already quite large. >> Depending on your index contents, the increase may be small or large. >> >> Something to just mention: It looks like your solrconfig.xml has >> hard-coded absolute paths for dataDir and updateLog. This is fine if >> you'll only ever have one core/collection on each server, but it'll be a >> disaster if you have multiples. I could be wrong about how these get >> interpreted in SolrCloud -- they might actually be relative despite >> starting with a slash. >> >> Thanks, >> Shawn >> >> > > > -- > > Annette Newton > > Database Administrator > > ServiceTick Ltd > > > > T:+44(0)1603 618326 > > > > Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ > > www.servicetick.com > > *www.sessioncam.com* > > -- > *This message is confidential and is intended to be read sole
Re: Delete from Solr Cloud 4.0 index..
On 5/3/2013 3:22 AM, Annette Newton wrote: > One question Shawn - did you ever get any costings around Zing? Did you > trial it? I never did do a trial. I asked them for a cost and they didn't have an immediate answer, wanted to do a phone call and get a lot of information about my setup. The price apparently has a lot of variance based on the specific environment, so I didn't pursue it, figuring that the cost would be higher than my superiors are willing to pay. The only information I could find about the cost of Zing was a very recent Register article that had this to say: "Azul is similarly cagey about what a supported version of the Zing JVM costs, and only says that Zing costs around what a supported version of an Oracle, IBM, or Red Hat JVM will run enterprises and that it has an annual subscription model for Zing pricing. You can't easily get pricing for Oracle, IBM, or Red Hat JVMs, of course, so the comparison is accurate but perfectly useless." http://www.theregister.co.uk/2013/04/08/azul_systems_zing_lmax_exchange/ Thanks, Shawn
Re: Delete from Solr Cloud 4.0 index..
One question Shawn - did you ever get any costings around Zing? Did you trial it? Thanks. On 3 May 2013 10:03, Annette Newton wrote: > Thanks Shawn. > > I have played around with Soft Commits before and didn't seem to have any > improvement, but with the current load testing I am doing I will give it > another go. > > I have researched docValues and came across the fact that it would > increase the index size. With the upgrade to 4.2.1 the index size has > reduced by approx 33% which is pleasing and I don't really want to lose > that saving. > > We do use the facet.enum method - which works really well, but I will > verify that we are using that in every instance, we have numerous > developers working on the product and maybe one or two have slipped > through. > > Right from the first I upped the zkClientTimeout to 30 as I wanted to give > extra time for any network blips that we experience on AWS. We only seem > to drop communication on a full garbage collection though. > > I am coming to the conclusion that we need to have more shards to cope > with the writes, so I will play around with adding more shards and see how > I go. > > I appreciate you having a look over our setup and the advice. > > Thanks again. > > Netty. > > > On 2 May 2013 23:17, Shawn Heisey wrote: > >> On 5/2/2013 4:24 AM, Annette Newton wrote: >> > Hi Shawn, >> > >> > Thanks so much for your response. We basically are very write intensive >> > and write throughput is pretty essential to our product. Reads are >> > sporadic and actually is functioning really well. >> > >> > We write on average (at the moment) 8-12 batches of 35 documents per >> > minute. But we really will be looking to write more in the future, so >> need >> > to work out scaling of solr and how to cope with more volume. >> > >> > Schema (I have changed the names) : >> > >> > http://pastebin.com/x1ry7ieW >> > >> > Config: >> > >> > http://pastebin.com/pqjTCa7L >> >> This is very clean. There's probably more you could remove/comment, but >> generally speaking I couldn't find any glaring issues. In particular, >> you have disabled autowarming, which is a major contributor to commit >> speed problems. >> >> The first thing I think I'd try is increasing zkClientTimeout to 30 or >> 60 seconds. You can use the startup commandline or solr.xml, I would >> probably use the latter. Here's a solr.xml fragment that uses a system >> property or a 15 second default: >> >> >> >> > zkClientTimeout="${zkClientTimeout:15000}" hostPort="${jetty.port:}" >> hostContext="solr"> >> >> General thoughts, these changes might not help this particular issue: >> You've got autoCommit with openSearcher=true. This is a hard commit. >> If it were me, I would set that up with openSearcher=false and either do >> explicit soft commits from my application or set up autoSoftCommit with >> a shorter timeframe than autoCommit. >> >> This might simply be a scaling issue, where you'll need to spread the >> load wider than four shards. I know that there are financial >> considerations with that, and they might not be small, so let's leave >> that alone for now. >> >> The memory problems might be a symptom/cause of the scaling issue I just >> mentioned. You said you're using facets, which can be a real memory hog >> even with only a few of them. Have you tried facet.method=enum to see >> how it performs? You'd need to switch to it exclusively, never go with >> the default of fc. You could put that in the defaults or invariants >> section of your request handler(s). >> >> Another way to reduce memory usage for facets is to use disk-based >> docValues on version 4.2 or later for the facet fields, but this will >> increase your index size, and your index is already quite large. >> Depending on your index contents, the increase may be small or large. >> >> Something to just mention: It looks like your solrconfig.xml has >> hard-coded absolute paths for dataDir and updateLog. This is fine if >> you'll only ever have one core/collection on each server, but it'll be a >> disaster if you have multiples. I could be wrong about how these get >> interpreted in SolrCloud -- they might actually be relative despite >> starting with a slash. >> >> Thanks, >> Shawn >> >> > > > -- > > Annette Newton > > Database Administrator > > ServiceTick Ltd > > > > T:+44(0)1603 618326 > > > > Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ > > www.servicetick.com > > *www.sessioncam.com* > -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secu
Re: Delete from Solr Cloud 4.0 index..
Thanks Shawn. I have played around with Soft Commits before and didn't seem to have any improvement, but with the current load testing I am doing I will give it another go. I have researched docValues and came across the fact that it would increase the index size. With the upgrade to 4.2.1 the index size has reduced by approx 33% which is pleasing and I don't really want to lose that saving. We do use the facet.enum method - which works really well, but I will verify that we are using that in every instance, we have numerous developers working on the product and maybe one or two have slipped through. Right from the first I upped the zkClientTimeout to 30 as I wanted to give extra time for any network blips that we experience on AWS. We only seem to drop communication on a full garbage collection though. I am coming to the conclusion that we need to have more shards to cope with the writes, so I will play around with adding more shards and see how I go. I appreciate you having a look over our setup and the advice. Thanks again. Netty. On 2 May 2013 23:17, Shawn Heisey wrote: > On 5/2/2013 4:24 AM, Annette Newton wrote: > > Hi Shawn, > > > > Thanks so much for your response. We basically are very write intensive > > and write throughput is pretty essential to our product. Reads are > > sporadic and actually is functioning really well. > > > > We write on average (at the moment) 8-12 batches of 35 documents per > > minute. But we really will be looking to write more in the future, so > need > > to work out scaling of solr and how to cope with more volume. > > > > Schema (I have changed the names) : > > > > http://pastebin.com/x1ry7ieW > > > > Config: > > > > http://pastebin.com/pqjTCa7L > > This is very clean. There's probably more you could remove/comment, but > generally speaking I couldn't find any glaring issues. In particular, > you have disabled autowarming, which is a major contributor to commit > speed problems. > > The first thing I think I'd try is increasing zkClientTimeout to 30 or > 60 seconds. You can use the startup commandline or solr.xml, I would > probably use the latter. Here's a solr.xml fragment that uses a system > property or a 15 second default: > > > >zkClientTimeout="${zkClientTimeout:15000}" hostPort="${jetty.port:}" > hostContext="solr"> > > General thoughts, these changes might not help this particular issue: > You've got autoCommit with openSearcher=true. This is a hard commit. > If it were me, I would set that up with openSearcher=false and either do > explicit soft commits from my application or set up autoSoftCommit with > a shorter timeframe than autoCommit. > > This might simply be a scaling issue, where you'll need to spread the > load wider than four shards. I know that there are financial > considerations with that, and they might not be small, so let's leave > that alone for now. > > The memory problems might be a symptom/cause of the scaling issue I just > mentioned. You said you're using facets, which can be a real memory hog > even with only a few of them. Have you tried facet.method=enum to see > how it performs? You'd need to switch to it exclusively, never go with > the default of fc. You could put that in the defaults or invariants > section of your request handler(s). > > Another way to reduce memory usage for facets is to use disk-based > docValues on version 4.2 or later for the facet fields, but this will > increase your index size, and your index is already quite large. > Depending on your index contents, the increase may be small or large. > > Something to just mention: It looks like your solrconfig.xml has > hard-coded absolute paths for dataDir and updateLog. This is fine if > you'll only ever have one core/collection on each server, but it'll be a > disaster if you have multiples. I could be wrong about how these get > interpreted in SolrCloud -- they might actually be relative despite > starting with a slash. > > Thanks, > Shawn > > -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*
Re: Delete from Solr Cloud 4.0 index..
On 5/2/2013 4:24 AM, Annette Newton wrote: > Hi Shawn, > > Thanks so much for your response. We basically are very write intensive > and write throughput is pretty essential to our product. Reads are > sporadic and actually is functioning really well. > > We write on average (at the moment) 8-12 batches of 35 documents per > minute. But we really will be looking to write more in the future, so need > to work out scaling of solr and how to cope with more volume. > > Schema (I have changed the names) : > > http://pastebin.com/x1ry7ieW > > Config: > > http://pastebin.com/pqjTCa7L This is very clean. There's probably more you could remove/comment, but generally speaking I couldn't find any glaring issues. In particular, you have disabled autowarming, which is a major contributor to commit speed problems. The first thing I think I'd try is increasing zkClientTimeout to 30 or 60 seconds. You can use the startup commandline or solr.xml, I would probably use the latter. Here's a solr.xml fragment that uses a system property or a 15 second default: General thoughts, these changes might not help this particular issue: You've got autoCommit with openSearcher=true. This is a hard commit. If it were me, I would set that up with openSearcher=false and either do explicit soft commits from my application or set up autoSoftCommit with a shorter timeframe than autoCommit. This might simply be a scaling issue, where you'll need to spread the load wider than four shards. I know that there are financial considerations with that, and they might not be small, so let's leave that alone for now. The memory problems might be a symptom/cause of the scaling issue I just mentioned. You said you're using facets, which can be a real memory hog even with only a few of them. Have you tried facet.method=enum to see how it performs? You'd need to switch to it exclusively, never go with the default of fc. You could put that in the defaults or invariants section of your request handler(s). Another way to reduce memory usage for facets is to use disk-based docValues on version 4.2 or later for the facet fields, but this will increase your index size, and your index is already quite large. Depending on your index contents, the increase may be small or large. Something to just mention: It looks like your solrconfig.xml has hard-coded absolute paths for dataDir and updateLog. This is fine if you'll only ever have one core/collection on each server, but it'll be a disaster if you have multiples. I could be wrong about how these get interpreted in SolrCloud -- they might actually be relative despite starting with a slash. Thanks, Shawn
Re: Delete from Solr Cloud 4.0 index..
Hi Shawn, Thanks so much for your response. We basically are very write intensive and write throughput is pretty essential to our product. Reads are sporadic and actually is functioning really well. We write on average (at the moment) 8-12 batches of 35 documents per minute. But we really will be looking to write more in the future, so need to work out scaling of solr and how to cope with more volume. Schema (I have changed the names) : http://pastebin.com/x1ry7ieW Config: http://pastebin.com/pqjTCa7L As you can see we haven't played around much with caches and such. I am now load testing on 4.2.1 and will be re-indexing our data so now is really the time to make any tweeks we can to get the throughput we want. We query based mostly on the latest documents added and use facet to populate drop downs for distinct values which the selection then gets added to the basic query of: rows=20&df=text&fl=Id,EP,ExP,PC,UTCTime,CIp,Br,OS,LU&start=0&q=UTCTime:[2013-04-25T23:00:00Z+TO+2013-05-02T22:00:00Z]+AND+H:(https\:\/\/.com)&sort=UTCTime+desc So we will add further fields onto the above, typically users are adding only 1 or 2 further restrictions. Facet queries will be the same as the above, we always restrict by the date and the customer reference. Hope this is enough information to be going on with. Again thanks for your help. Netty. On 1 May 2013 17:31, Shawn Heisey wrote: > On 5/1/2013 8:42 AM, Annette Newton wrote: > >> It was a single delete with a date range query. We have 8 machines each >> with 35GB memory, 10GB is allocated to the JVM. Garbage collection has >> always been a problem for us with the heap not clearing on Full garbage >> collection. I don't know what is being held in memory and refuses to be >> collected. >> >> I have seen your java heap configuration on previous posts and it's very >> like ours except that we are not currently using LargePages (I don't know >> how much difference that has made to your memory usage). >> >> We have tried various configurations around Java including the G1 >> collector >> (which was awful) but all settings seem to leave the old generation at >> least 50% full, so it quickly fills up again. >> >> -Xms10240M -Xmx10240M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC >> -XX:+CMSParallelRemarkEnabled -XX:NewRatio=2 -XX:+CMSScavengeBeforeRemark >> -XX:CMSWaitDuration=5000 -XX:+CMSClassUnloadingEnabled >> -XX:**CMSInitiatingOccupancyFraction**=80 -XX:+** >> UseCMSInitiatingOccupancyOnly >> >> If I could only figure out what keeps the heap to the current level I feel >> we would be in a better place with solr. >> > > With a single delete request, it was probably the commit that was very > slow and caused the problem, not the delete itself. This has been my > experience with my large indexes. > > My attempts with the G1 collector were similarly awful. The idea seems > sound on paper, but Oracle needs to do some work in making it better for > large heaps. Because my GC tuning was not very disciplined, I do not know > how much impact UseLargePages is having. > > Your overall RAM allocation should be good. If these machines aren't > being used for other software, then you have 24-25GB of memory available > for caching your index, which should be very good with 26GB of index for > that machine. > > Looking over your message history, I see that you're using Amazon EC2. > Solr performs much better on bare metal, although the EC2 instance you're > using is probably very good. > > SolrCloud is optimized for machines that are on the same Ethernet LAN. > Communication between EC2 VMs (especially if they are not located in nearby > data centers) will have some latency and a potential for dropped packets. > I'm going to proceed with the idea that EC2 and virtualization are not the > problems here. > > I'm not really surprised to hear that with an index of your size that so > much of a 10GB heap is retained. There may be things that could reduce > your memory usage, so could you share your solrconfig.xml and schema.xml > with a paste site that does XML highlighting (pastie.org being a good > example), and give us an idea of how often you update and commit? Feel > free to search/replace sensitive information, as long that work is > consistent and you don't entirely remove it. Armed with that information, > we can have a discussion about your needs and how to achieve them. > > Do you know how long cache autowarming is taking? The cache statistics > should tell you how long it took on the last commit. > > Some examples of typical real-world queries would be helpful too. Examples > should be relatively complex for your setup, but not worst-case. An > example query for my setup that meets this requirement would probably be > 4-10KB in size ... some of them are 20KB! > > Not really related - a question about one of your old messages that never > seemed to get resolved: Are you still seeing a lot of CLOSE_WAIT > connections in your TCP table? A later m
Re: Delete from Solr Cloud 4.0 index..
On 5/1/2013 8:42 AM, Annette Newton wrote: It was a single delete with a date range query. We have 8 machines each with 35GB memory, 10GB is allocated to the JVM. Garbage collection has always been a problem for us with the heap not clearing on Full garbage collection. I don't know what is being held in memory and refuses to be collected. I have seen your java heap configuration on previous posts and it's very like ours except that we are not currently using LargePages (I don't know how much difference that has made to your memory usage). We have tried various configurations around Java including the G1 collector (which was awful) but all settings seem to leave the old generation at least 50% full, so it quickly fills up again. -Xms10240M -Xmx10240M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:NewRatio=2 -XX:+CMSScavengeBeforeRemark -XX:CMSWaitDuration=5000 -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly If I could only figure out what keeps the heap to the current level I feel we would be in a better place with solr. With a single delete request, it was probably the commit that was very slow and caused the problem, not the delete itself. This has been my experience with my large indexes. My attempts with the G1 collector were similarly awful. The idea seems sound on paper, but Oracle needs to do some work in making it better for large heaps. Because my GC tuning was not very disciplined, I do not know how much impact UseLargePages is having. Your overall RAM allocation should be good. If these machines aren't being used for other software, then you have 24-25GB of memory available for caching your index, which should be very good with 26GB of index for that machine. Looking over your message history, I see that you're using Amazon EC2. Solr performs much better on bare metal, although the EC2 instance you're using is probably very good. SolrCloud is optimized for machines that are on the same Ethernet LAN. Communication between EC2 VMs (especially if they are not located in nearby data centers) will have some latency and a potential for dropped packets. I'm going to proceed with the idea that EC2 and virtualization are not the problems here. I'm not really surprised to hear that with an index of your size that so much of a 10GB heap is retained. There may be things that could reduce your memory usage, so could you share your solrconfig.xml and schema.xml with a paste site that does XML highlighting (pastie.org being a good example), and give us an idea of how often you update and commit? Feel free to search/replace sensitive information, as long that work is consistent and you don't entirely remove it. Armed with that information, we can have a discussion about your needs and how to achieve them. Do you know how long cache autowarming is taking? The cache statistics should tell you how long it took on the last commit. Some examples of typical real-world queries would be helpful too. Examples should be relatively complex for your setup, but not worst-case. An example query for my setup that meets this requirement would probably be 4-10KB in size ... some of them are 20KB! Not really related - a question about one of your old messages that never seemed to get resolved: Are you still seeing a lot of CLOSE_WAIT connections in your TCP table? A later message from you mentioned 4.2.1, so I'm wondering specifically about that version. Thanks, Shawn
Re: Delete from Solr Cloud 4.0 index..
Hi Shawn Thanks for the reply. It was a single delete with a date range query. We have 8 machines each with 35GB memory, 10GB is allocated to the JVM. Garbage collection has always been a problem for us with the heap not clearing on Full garbage collection. I don't know what is being held in memory and refuses to be collected. I have seen your java heap configuration on previous posts and it's very like ours except that we are not currently using LargePages (I don't know how much difference that has made to your memory usage). We have tried various configurations around Java including the G1 collector (which was awful) but all settings seem to leave the old generation at least 50% full, so it quickly fills up again. -Xms10240M -Xmx10240M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:NewRatio=2 -XX:+CMSScavengeBeforeRemark -XX:CMSWaitDuration=5000 -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly If I could only figure out what keeps the heap to the current level I feel we would be in a better place with solr. Thanks. On 1 May 2013 14:40, Shawn Heisey wrote: > On 5/1/2013 3:39 AM, Annette Newton wrote: > > We have a 4 shard - 2 replica solr cloud setup, each with about 26GB of > > index. A total of 24,000,000. We issued a rather large delete yesterday > > morning to reduce that size by about half, this resulted in the loss of > all > > shards while the delete was taking place, but when it had apparently > > finished as soon as we started writing again we continued to lose shards. > > > > We have also issued much smaller deletes and lost shards but before they > > have always come back ok. This time we couldn't keep them online. We > > ended up rebuilding out cloud setup and switching over to it. > > > > Is there a better process for deleting documents? Is this expected > > behaviour? > > How was the delete composed? Was it a single request with a simple > query, or was a it a huge list of IDs or a huge query? Was it millions > of individual delete queries? All of those should be fine, but the last > option is the hardest on Solr, especially if you are doing a lot of > commits at the same time. You might need to increase the zkTimeout > value on your startup commandline or in solr.xml. > > How many machines do your eight SolrCloud replicas live on? How much RAM > to they have? How much of that memory is allocated to the Java heap? > > Assuming that your SolrCloud is living on eight separate machines that > each have a 26GB index, I hope that you have 16 to 32 GB of RAM on each > of those machines, and that a large chunk of that RAM is not allocated > to Java or any other program. If you don't, then it will be very > difficult to get good performance out of Solr, especially for index > commits. If you have multiple 26GB shards per machine, you'll need even > more free memory. The free memory is used to cache your index files. > > Another possible problem here is Java garbage collection pauses. If you > have a large max heap and don't have a tuned GC configuration, then the > only way to fix this is to reduce your heap and/or to tune Java's > garbage collection. > > Thanks, > Shawn > > -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*
Re: Delete from Solr Cloud 4.0 index..
On 5/1/2013 3:39 AM, Annette Newton wrote: > We have a 4 shard - 2 replica solr cloud setup, each with about 26GB of > index. A total of 24,000,000. We issued a rather large delete yesterday > morning to reduce that size by about half, this resulted in the loss of all > shards while the delete was taking place, but when it had apparently > finished as soon as we started writing again we continued to lose shards. > > We have also issued much smaller deletes and lost shards but before they > have always come back ok. This time we couldn't keep them online. We > ended up rebuilding out cloud setup and switching over to it. > > Is there a better process for deleting documents? Is this expected > behaviour? How was the delete composed? Was it a single request with a simple query, or was a it a huge list of IDs or a huge query? Was it millions of individual delete queries? All of those should be fine, but the last option is the hardest on Solr, especially if you are doing a lot of commits at the same time. You might need to increase the zkTimeout value on your startup commandline or in solr.xml. How many machines do your eight SolrCloud replicas live on? How much RAM to they have? How much of that memory is allocated to the Java heap? Assuming that your SolrCloud is living on eight separate machines that each have a 26GB index, I hope that you have 16 to 32 GB of RAM on each of those machines, and that a large chunk of that RAM is not allocated to Java or any other program. If you don't, then it will be very difficult to get good performance out of Solr, especially for index commits. If you have multiple 26GB shards per machine, you'll need even more free memory. The free memory is used to cache your index files. Another possible problem here is Java garbage collection pauses. If you have a large max heap and don't have a tuned GC configuration, then the only way to fix this is to reduce your heap and/or to tune Java's garbage collection. Thanks, Shawn
Delete from Solr Cloud 4.0 index..
We have a 4 shard - 2 replica solr cloud setup, each with about 26GB of index. A total of 24,000,000. We issued a rather large delete yesterday morning to reduce that size by about half, this resulted in the loss of all shards while the delete was taking place, but when it had apparently finished as soon as we started writing again we continued to lose shards. We have also issued much smaller deletes and lost shards but before they have always come back ok. This time we couldn't keep them online. We ended up rebuilding out cloud setup and switching over to it. Is there a better process for deleting documents? Is this expected behaviour? Thanks very much. -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*