Re: Zk big files issues and model store

Florin Babes Sat, 21 Oct 2023 08:20:50 -0700

Hello,

I've created an issue and a PR (
https://issues.apache.org/jira/browse/SOLR-17050)
The models will also be smaller because they will be saved without newlines
and without spaces after keys:


> ~ cat d.txt
> "x":{"y":{"z":{"foobar":42}}}
> ~ wc -c d.txt
>       30 d.txt


For example the following model:

> ~ cat modelExamples/linear-model.json
> {
>   "class": "org.apache.solr.ltr.model.LinearModel",
>   "name": "6029760550880411648",
>   "features": [
>     {
>       "name": "title"
>     },
>     {
>       "name": "description"
>     },
>     {
>       "name": "keywords"
>     },
>     {
>       "name": "popularity",
>       "norm": {
>         "class": "org.apache.solr.ltr.norm.MinMaxNormalizer",
>         "params": {
>           "min": "0.0f",
>           "max": "10.0f"
>         }
>       }
>     },
>     {
>       "name": "text"
>     },
>     {
>       "name": "queryIntentPerson"
>     },
>     {
>       "name": "queryIntentCompany"
>     }
>   ],
>   "params": {
>     "weights": {
>       "title": 0.0000000000,
>       "description": 0.1000000000,
>       "keywords": 0.2000000000,
>       "popularity": 0.3000000000,
>       "text": 0.4000000000,
>       "queryIntentPerson": 0.1231231,
>       "queryIntentCompany": 0.12121211
>     }
>   }
> }
> ~ curl -XPUT 'http://localhost:8983/solr/small_models/schema/model-store'
> --data-binary "@modelExamples/linear-model.json" -H
> 'Content-type:application/json'
> {
>   "responseHeader":{
>     "status":0,
>     "QTime":23
>   }
> }

will be saved as

> curl
> http://localhost:8983/solr/small_models/admin/file\?wt\=json\&_\=1697901271281\&file\=_schema_model-store.json\&contentType\=application%2Fjson%3Bcharset%3Dutf-8
>
> {"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}


which is* 42%* smaller than compacted




~ echo
> '{"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}'
> | jq | wc -c
>     *1688*

~ echo
> '{"initArgs":{},"initializedOn":"2023-10-21T15:13:08.571Z","managedList":[{"name":"6029760550880411648","class":"org.apache.solr.ltr.model.LinearModel","store":"_DEFAULT_","features":[{"name":"title","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"description","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"keywords","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"popularity","norm":{"class":"org.apache.solr.ltr.norm.MinMaxNormalizer","params":{"min":"0.0","max":"10.0"}}},{"name":"text","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentPerson","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}},{"name":"queryIntentCompany","norm":{"class":"org.apache.solr.ltr.norm.IdentityNormalizer"}}],"params":{"weights":{"title":0.0,"description":0.1,"keywords":0.2,"popularity":0.3,"text":0.4,"queryIntentPerson":0.1231231,"queryIntentCompany":0.12121211}}}]}'
> | wc -c
>      *977*


Multumesc,

Florin Babeş
0762680124
babesflo...@gmail.com


În joi, 19 oct. 2023 la 16:27, Christine Poerschke (BLOOMBERG/ LONDON) <
cpoersc...@bloomberg.net> a scris:

> Hello Florin.
>
> Of course, do feel free to open an issue and/or draft pull request and/or
> pull request.
>
> If the model is wrapped internally, it would be smaller than the original
> (since no two-space indentations) but slightly bigger than the compacted
> (with zero-space indentations) due to \" escaping for " characters.
>
> Illustration:
>
> $cat a.txt
> "x" : {
>   "y" : {
>     "z" : {
>       "foobar" : 42
>     }
>   }
> }
> $wc -c a.txt
>       62 a.txt
>
> $cat b.txt
> "x" : {
> "y" : {
> "z" : {
> "foobar" : 42
> }
> }
> }
> $wc -c b.txt
>       44 b.txt
>
> $cat c.txt
> \"x\" : { \"y\" : { \"z\" : { \"foobar\" : 42 } } }
> $wc -c c.txt
>       52 c.txt
> $
>
> From: users@solr.apache.org At: 10/18/23 20:40:24 UTC+1:00To:
> users@solr.apache.org
> Subject: Re: Zk big files issues and model store
>
> Thanks for the suggestion Matthias. I will look into this.
>
> Hello Christine. One of the concerns is the split nature but also that
> if the file does not exist on disk when the replica reloads, the core
> would not load. To keep the models in sync on each node can be quite
> complicated. For example you will only have to reload the collection
> only after the main model is present on all nodes, if you do it before
> that the replicas will be unusable. For now we would like to load
> models up to 100MB and that's why I explored this option.
> I did some modifications in the code but I haven't tested them yet.
> After I do the tests, I will come with a PR. Can I open an issue with
> this?
>
> If the model would be wrapped internal, wouldn't that be the same as
> saving it as compacted json? It will be the approximately same size
> and we will still need to load in memory the decoded object. To save
> size we could reduce the features size to some abbreviations but that
> will complicate the score debug.
> Haven't looked yet in storing models in another format. Walter could
> have a point in AVRO.
>
> Thanks for the suggestion Eric. I am not familiar with the
> /api/cluster/files endpoint. I will look into it.
>
>
> În mie., 18 oct. 2023 la 01:47, Dmitri Maziuk <dmitri.maz...@gmail.com> a
> scris:
> >
> > On 10/17/23 13:20, Walter Underwood wrote:
> > >
> > > Gzipping the JSON can be a big win, especially if there are lots of
> repeated keys, like in state.json. Gzip has the advantage that some
> editors can
> natively unpack it.
> >
> > It may save you some transfer time, provided the transport subsystem
> > doesn't compress on the fly, but with JSON being all-or-nothing format,
> > your problem's going to be RAM for the string representation plus RAM
> > for the decoded object representation, of the entire store.
> >
> > If you want it scalable, you want an "incremental" format like asn.1,
> > protocol buffers, or avro.
> >
> > Dima
> >
>
>
>

Re: Zk big files issues and model store

Reply via email to