[
https://issues.apache.org/jira/browse/SOLR-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Noble Paul updated SOLR-810:
----------------------------
Description:
For storage purposes javabin can be quite inefficient assuming that we write
one document at a time. The field names may be written for each document which
makes it inefficient.
javabin can be as efficient as a format like say thrift/protocol buffers if we
do not pay the price of a string per name. We can easily achieve it using a new
type KNOWN_STRING.
KNOWN_STRING can be like an EXTERN_STRING but it is just that these are
preconfigured string names which is a map of index -> string . The known string
list can probably have a version . The client must be using a newer version
known string list than the server .
an example looks like
{code}
1:responseHeader
2:QTime
3:status
{code}
A newer version of the string list can add a new string at a new index but it
must never change the index of an existing string. This is similar to an IDL
file of thrift/protocol buffers but w/o any of those complexities
So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING map.
If it is present , it is written as a KNOWN_STRING instead of an EXTERN_STRING
. The value will be the index
Another addition could be a zip string type. This is useful when javabin is
used for storing data . In storage, the performance cost of
serialization/deserialization may not be as important as the space itself.
This may also have a minimum size to compress . Only large strings (say > 2KB?)
may need to be serialized
was:
javabin can be as efficient as a format like say thrift/protocol buffers if we
do not pay the price of a string per name. We can easily achieve it using a new
type KNOWN_STRING.
KNOWN_STRING can be like an EXTERN_STRING but it is just that these are
preconfigured string names which is a map of index -> string . The known string
list can probably have a version . The client must be using a newer version
known string list than the server .
an example looks like
{code}
1:responseHeader
2:QTime
3:status
{code}
A newer version of the string list can add a new string at a new index but it
must never change the index of an existing string. This is similar to an IDL
file of thrift/protocol buffers but w/o any of those complexities
So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING map.
If it is present , it is written as a KNOWN_STRING instead of an EXTERN_STRING
. The value will be the index
> changes for javabin format
> --------------------------
>
> Key: SOLR-810
> URL: https://issues.apache.org/jira/browse/SOLR-810
> Project: Solr
> Issue Type: Improvement
> Reporter: Noble Paul
>
> For storage purposes javabin can be quite inefficient assuming that we write
> one document at a time. The field names may be written for each document
> which makes it inefficient.
> javabin can be as efficient as a format like say thrift/protocol buffers if
> we do not pay the price of a string per name. We can easily achieve it using
> a new type KNOWN_STRING.
> KNOWN_STRING can be like an EXTERN_STRING but it is just that these are
> preconfigured string names which is a map of index -> string . The known
> string list can probably have a version . The client must be using a newer
> version known string list than the server .
> an example looks like
> {code}
> 1:responseHeader
> 2:QTime
> 3:status
> {code}
> A newer version of the string list can add a new string at a new index but it
> must never change the index of an existing string. This is similar to an IDL
> file of thrift/protocol buffers but w/o any of those complexities
> So when an EXTERN_STRING is written it first looks up in the KNOWN_STRING
> map. If it is present , it is written as a KNOWN_STRING instead of an
> EXTERN_STRING . The value will be the index
> Another addition could be a zip string type. This is useful when javabin is
> used for storing data . In storage, the performance cost of
> serialization/deserialization may not be as important as the space itself.
> This may also have a minimum size to compress . Only large strings (say >
> 2KB?) may need to be serialized
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.