Hello, I've created an issue and a PR ( https://issues.apache.org/jira/browse/SOLR-17050) The models will also be smaller because they will be saved without newlines and without spaces after keys:
> ~ cat d.txt > "x":{"y":{"z":{"foobar":42}}} > ~ wc -c d.txt > 30 d.txt For example the following model: > ~ cat modelExamples/linear-model.json > { > "class": "org.apache.solr.ltr.model.LinearModel", > "name": "6029760550880411648", > "features": [ > { > "name": "title" > }, > { > "name": "description" > }, > { > "name": "keywords" > }, > { > "name": "popularity", > "norm": { > "class": "org.apache.solr.ltr.norm.MinMaxNormalizer", > "params": { > "min": "0.0f", > "max": "10.0f" > } > } > }, > { > "name": "text" > }, > { > "name": "queryIntentPerson" > }, > { > "name": "queryIntentCompany" > } > ], > "params": { > "weights": { > "title": 0.0000000000, > "description": 0.1000000000, > "keywords": 0.2000000000, > "popularity": 0.3000000000, > "text": 0.4000000000, > "queryIntentPerson": 0.1231231, > "queryIntentCompany": 0.12121211 > } > } > } > ~ curl -XPUT 'http://localhost:8983/solr/small_models/schema/model-store' > --data-binary "@modelExamples/linear-model.json" -H > 'Content-type:application/json' > { > "responseHeader":{ > "status":0, > "QTime":23 > } > } will be saved as > curl > http://localhost:8983/solr/small_models/admin/file\?wt\=json\&_\=1697901271281\&file\=_schema_model-store.json\&contentType\=application%2Fjson%3Bcharset%3Dutf-8 > > {"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]} which is* 42%* smaller than compacted ~ echo > '{"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}' > | jq | wc -c > *1688* ~ echo > '{"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}' > | wc -c > *977* Multumesc, Florin Babeş 0762680124 babesflo...@gmail.com În joi, 19 oct. 2023 la 16:27, Christine Poerschke (BLOOMBERG/ LONDON) < cpoersc...@bloomberg.net> a scris: > Hello Florin. > > Of course, do feel free to open an issue and/or draft pull request and/or > pull request. > > If the model is wrapped internally, it would be smaller than the original > (since no two-space indentations) but slightly bigger than the compacted > (with zero-space indentations) due to \" escaping for " characters. > > Illustration: > > $cat a.txt > "x" : { > "y" : { > "z" : { > "foobar" : 42 > } > } > } > $wc -c a.txt > 62 a.txt > > $cat b.txt > "x" : { > "y" : { > "z" : { > "foobar" : 42 > } > } > } > $wc -c b.txt > 44 b.txt > > $cat c.txt > \"x\" : { \"y\" : { \"z\" : { \"foobar\" : 42 } } } > $wc -c c.txt > 52 c.txt > $ > > From: users@solr.apache.org At: 10/18/23 20:40:24 UTC+1:00To: > users@solr.apache.org > Subject: Re: Zk big files issues and model store > > Thanks for the suggestion Matthias. I will look into this. > > Hello Christine. One of the concerns is the split nature but also that > if the file does not exist on disk when the replica reloads, the core > would not load. To keep the models in sync on each node can be quite > complicated. For example you will only have to reload the collection > only after the main model is present on all nodes, if you do it before > that the replicas will be unusable. For now we would like to load > models up to 100MB and that's why I explored this option. > I did some modifications in the code but I haven't tested them yet. > After I do the tests, I will come with a PR. Can I open an issue with > this? > > If the model would be wrapped internal, wouldn't that be the same as > saving it as compacted json? It will be the approximately same size > and we will still need to load in memory the decoded object. To save > size we could reduce the features size to some abbreviations but that > will complicate the score debug. > Haven't looked yet in storing models in another format. Walter could > have a point in AVRO. > > Thanks for the suggestion Eric. I am not familiar with the > /api/cluster/files endpoint. I will look into it. > > > În mie., 18 oct. 2023 la 01:47, Dmitri Maziuk <dmitri.maz...@gmail.com> a > scris: > > > > On 10/17/23 13:20, Walter Underwood wrote: > > > > > > Gzipping the JSON can be a big win, especially if there are lots of > repeated keys, like in state.json. Gzip has the advantage that some > editors can > natively unpack it. > > > > It may save you some transfer time, provided the transport subsystem > > doesn't compress on the fly, but with JSON being all-or-nothing format, > > your problem's going to be RAM for the string representation plus RAM > > for the decoded object representation, of the entire store. > > > > If you want it scalable, you want an "incremental" format like asn.1, > > protocol buffers, or avro. > > > > Dima > > > > >