Hi,

We are using Solr 7.1.0 to index a database of addresses.  We have found that 
our index size increases massively when we add one extra field to the index, 
even though that field is stored and not indexed, and doesn’t contain a lot of 
data.  When this occurs, we also observe a significant increase in response 
times and CPU usage on the Solr server.

When we run an index load without the problematic field present, the Solr index 
size is 5.5GB.  When we add the field into the index, the size grows to 13.3GB. 
 The field itself is a maximum of 46 characters in length and on average is 19 
characters. We have ~14,000,000 rows in total to index of which only ~200,000 
have this field present at all (i.e. not null in database).  Given that we 
don’t want to index the field, only store it I would have thought (perhaps 
naively) that the storage increase would be approximately 200,000 * 19 = 3.8M 
bytes = 3.6MB rather than the 7.5GB we are seeing.

Some further background on what we are doing:

- We are using the Solr 7.1.0 docker image for our Solr server
- We are importing the data from an Oracle table using JDBC and the standard 
dataimport request handler
- As we want to push the docker image to AWS ECR which only accepts docker 
layers of a maximum of 10GB, we load the index in four separate imports, 
stopping Solr gracefully in between each load
- Our index contains 48 fields in total
- The problematic field is created through the API as follows:

  curl -X POST -H 'Content-type:application/json' --data-binary '{
    "add-field":{
      "name":"buildingName",
      "type":"string",
      "stored":true,
      "indexed":false
    }
  }' http://localhost:8983/solr/address/schema

I have also tried using SolrText instead of string, but that doesn't make a 
noticeable difference.

It also makes a difference how many records are loaded.  If I only load 
1,000,000 records (that have a proportionate number of building names) then the 
size of the index with and without buildingName is about the same (~1GB).

Is there some sort of limit that I'm not aware of that we are hitting, either 
number of fields or size of data?  Is there some kind of corrupt data that I 
need to look for in the buildingName field that could cause this (it's just a 
varchar2(46) field in Oracle)?

Thanks for your assistance,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.

Reply via email to