Hi, We are using Solr 7.1.0 to index a database of addresses. We have found that our index size increases massively when we add one extra field to the index, even though that field is stored and not indexed, and doesn’t contain a lot of data. When this occurs, we also observe a significant increase in response times and CPU usage on the Solr server.
When we run an index load without the problematic field present, the Solr index size is 5.5GB. When we add the field into the index, the size grows to 13.3GB. The field itself is a maximum of 46 characters in length and on average is 19 characters. We have ~14,000,000 rows in total to index of which only ~200,000 have this field present at all (i.e. not null in database). Given that we don’t want to index the field, only store it I would have thought (perhaps naively) that the storage increase would be approximately 200,000 * 19 = 3.8M bytes = 3.6MB rather than the 7.5GB we are seeing. Some further background on what we are doing: - We are using the Solr 7.1.0 docker image for our Solr server - We are importing the data from an Oracle table using JDBC and the standard dataimport request handler - As we want to push the docker image to AWS ECR which only accepts docker layers of a maximum of 10GB, we load the index in four separate imports, stopping Solr gracefully in between each load - Our index contains 48 fields in total - The problematic field is created through the API as follows: curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-field":{ "name":"buildingName", "type":"string", "stored":true, "indexed":false } }' http://localhost:8983/solr/address/schema I have also tried using SolrText instead of string, but that doesn't make a noticeable difference. It also makes a difference how many records are loaded. If I only load 1,000,000 records (that have a proportionate number of building names) then the size of the index with and without buildingName is about the same (~1GB). Is there some sort of limit that I'm not aware of that we are hitting, either number of fields or size of data? Is there some kind of corrupt data that I need to look for in the buildingName field that could cause this (it's just a varchar2(46) field in Oracle)? Thanks for your assistance, David David Howe Java Domain Architect Postal Systems Level 16, 111 Bourke Street Melbourne VIC 3000 T 0391067904 M 0424036591 E david.h...@auspost.com.au W auspost.com.au W startrack.com.au Australia Post is committed to providing our customers with excellent service. If we can assist you in any way please telephone 13 13 18 or visit our website. The information contained in this email communication may be proprietary, confidential or legally professionally privileged. It is intended exclusively for the individual or entity to which it is addressed. You should only read, disclose, re-transmit, copy, distribute, act in reliance on or commercialise the information if you are authorised to do so. Australia Post does not represent, warrant or guarantee that the integrity of this email communication has been maintained nor that the communication is free of errors, virus or interference. If you are not the addressee or intended recipient please notify us by replying direct to the sender and then destroy any electronic or paper copy of this message. Any views expressed in this email communication are taken to be those of the individual sender, except where the sender specifically attributes those views to Australia Post and is authorised to do so. Please consider the environment before printing this email.