Thanks a lot Todd for your help, this really clarified the encoding part
which I was thinking to implement.
on slowness in read, I will share more details soon.
On Tue, Oct 11, 2016 at 7:28 AM, Todd Lipcon <t...@cloudera.com> wrote:
> Hey Amit,
> Some responses below:
> On Mon, Oct 10, 2016 at 5:27 AM, Amit Adhau <amit.ad...@globant.com>
>> Hi Kudu Team,
>> I was doing a testing for the Dictionary & Prefix Encoding in Kudu table.
>> To do so, I have created two tables with same structure and same data.
>> Inserted 1 billion records into both the tables, having on an average close
>> to 1kb record size.
>> I have observed below;
>> On disk storage level - I have found substantial difference between the
>> encoded column table and non-encoded column table size, as encoded column
>> table took very less space as compare to non-encoded column table.
> Yes, that's expected -- one of the most important purposes of encodings is
> to reduce data size on disk.
>> On validating scan performance - I have found that running queries
>> against a table with encoded column took less time[always], as compare to
>> running queries on non-encoded column table.
>> Can you please help me on below queries;
>> 1. Scan on encoded columns takes less time, is this expected behavior?
> It's often the case, especially if the data is large enough that it isn't
> fitting in cache. There are some cases where it's not faster, though. For
> example, if you use bitshuffle encodings on integers, and the size of the
> column was small enough that it was fully cached, it would be faster to
> scan unencoded integers compared to encoded ones. That balance changes,
> though, if the data no longer fits in RAM, since the reduced IO cost (due
> to the encoding) offsets the increased CPU cost (due to having to decode in
> order to service the query).
> With dictionary compression of strings, however, it should basically
> always be the case that it's beneficial. This is especially true if you
> have predicates on the encoded columns ('WHERE' clauses in SQL
> terminology), and especially after v1.0 in which there were some
> optimizations in this area.
>> 2. Just to confirm, In case of, composite primary key, as per
>> understanding it can be helpful to have prefix encoding implemented on
>> first column or first few columns where the values could be same Or may be
>> a column like webpage url in clickstream logs can have Prefix encoding
> For the case of string columns at the beginning of a composite key, you're
> right that prefix encoding is often a good choice. Note that internally
> Kudu synthesizes a "composite key" column (not exposed to the user) which
> concatenates your PK columns, and that _always_ uses PREFIX encoding,
> regardless of what you've selected for the columns themselves.
>> 3. As per the release note for Dictionary encoding;
>> "If the column values of a given row set are unable to be compressed
>> because the number of unique values is too high, Kudu will transparently
>> fall back to plain encoding for that row set"
>> Is there any method to find out the probable upper number for unique
>> values, that the dictionary encoding can handle and in such case, as stated
>> it will back to plain encoding, So will it be applicable to the records
>> inserted after the upper limit exceeds i.e. only they will be in plain
>> encoding or kudu will convert all the values[including existing] for
>> dictionary encoded column into plain encoding automatically? will there be
>> any impact at functional level?
> This is all fully automatic, and the choice of encoding happens at a small
> block level, not at the entire table level. So even if you have a very
> large number of unique values globally across the table, if "nearby" rows
> (ie within a few MB of each other) have low number of distinct elements,
> you will benefit from dictionary.
> Dictionary compression is so often the correct choice for strings that
> I've been thinking we should probably make it the default :)
>> 4. Since gflags like --cfile_do_on_finish=flush and --flush_threshold_mb
>> are defaults in latest versions. Are there any other tunning flags or
>> configs that can be helpful to improve the performance at insert level.
>> Also, at the scan level, we are using the ScanToken API & hash
>> partitions, but still the scan performance seems to be slow, can you please
>> suggest if anything else can be done at the configuration level or
>> implementation level to improve the scan performance.
> For inserts, there aren't any flags I can recommend that wouldn't have
> negative consequences. However, it's worth noting that the upcoming 1.1
> release will have a few optimizations on the write side that might increase
> your throughput substantially, especially if you're using Impala to drive
> the inserts.
> On the read path, the most important thing is to make sure you have enough
> partitions per node to get proper parallelism on the reads. But, there are
> a lot of factors. Can you quantify what you mean by "slow", and
> particularly what your point of reference is? Maybe share some sample
> queries and dataset characteristics?
> Todd Lipcon
> Software Engineer, Cloudera
Thanks & Regards,
*Amit Adhau* | Data Architect
*GLOBANT* | IND:+91 9821518132
[image: Facebook] <https://www.facebook.com/Globant>
[image: Twitter] <http://www.twitter.com/globant>
[image: Youtube] <http://www.youtube.com/Globant>
[image: Linkedin] <http://www.linkedin.com/company/globant>
[image: Pinterest] <http://pinterest.com/globant/>
[image: Globant] <http://www.globant.com/>
The information contained in this e-mail may be confidential. It has been
sent for the sole use of the intended recipient(s). If the reader of this
message is not an intended recipient, you are hereby notified that any
unauthorized review, use, disclosure, dissemination, distribution or
copying of this communication, or any of its contents,
is strictly prohibited. If you have received it by mistake please let us
know by e-mail immediately and delete it from your system. Many thanks.
La información contenida en este mensaje puede ser confidencial. Ha sido
enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de
este mensaje no fuera el destinatario previsto, por el presente queda Ud.
notificado que cualquier lectura, uso, publicación, diseminación,
distribución o copiado de esta comunicación o su contenido está
estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje
por error le agradeceremos notificarnos por e-mail inmediatamente y
eliminarlo de su sistema. Muchas gracias.