Hi Kudu Team, I was doing a testing for the Dictionary & Prefix Encoding in Kudu table. To do so, I have created two tables with same structure and same data. Inserted 1 billion records into both the tables, having on an average close to 1kb record size. I have observed below; On disk storage level - I have found substantial difference between the encoded column table and non-encoded column table size, as encoded column table took very less space as compare to non-encoded column table. On validating scan performance - I have found that running queries against a table with encoded column took less time[always], as compare to running queries on non-encoded column table.
Can you please help me on below queries; 1. Scan on encoded columns takes less time, is this expected behavior? 2. Just to confirm, In case of, composite primary key, as per understanding it can be helpful to have prefix encoding implemented on first column or first few columns where the values could be same Or may be a column like webpage url in clickstream logs can have Prefix encoding implemented. 3. As per the release note for Dictionary encoding; "If the column values of a given row set are unable to be compressed because the number of unique values is too high, Kudu will transparently fall back to plain encoding for that row set" Is there any method to find out the probable upper number for unique values, that the dictionary encoding can handle and in such case, as stated it will back to plain encoding, So will it be applicable to the records inserted after the upper limit exceeds i.e. only they will be in plain encoding or kudu will convert all the values[including existing] for dictionary encoded column into plain encoding automatically? will there be any impact at functional level? 4. Since gflags like --cfile_do_on_finish=flush and --flush_threshold_mb are defaults in latest versions. Are there any other tunning flags or configs that can be helpful to improve the performance at insert level. Also, at the scan level, we are using the ScanToken API & hash partitions, but still the scan performance seems to be slow, can you please suggest if anything else can be done at the configuration level or implementation level to improve the scan performance. -- Thanks & Regards, *Amit Adhau* | Data Architect *GLOBANT* | IND:+91 9821518132 [image: Facebook] <https://www.facebook.com/Globant> [image: Twitter] <http://www.twitter.com/globant> [image: Youtube] <http://www.youtube.com/Globant> [image: Linkedin] <http://www.linkedin.com/company/globant> [image: Pinterest] <http://pinterest.com/globant/> [image: Globant] <http://www.globant.com/> -- The information contained in this e-mail may be confidential. It has been sent for the sole use of the intended recipient(s). If the reader of this message is not an intended recipient, you are hereby notified that any unauthorized review, use, disclosure, dissemination, distribution or copying of this communication, or any of its contents, is strictly prohibited. If you have received it by mistake please let us know by e-mail immediately and delete it from your system. Many thanks. La información contenida en este mensaje puede ser confidencial. Ha sido enviada para el uso exclusivo del destinatario(s) previsto. Si el lector de este mensaje no fuera el destinatario previsto, por el presente queda Ud. notificado que cualquier lectura, uso, publicación, diseminación, distribución o copiado de esta comunicación o su contenido está estrictamente prohibido. En caso de que Ud. hubiera recibido este mensaje por error le agradeceremos notificarnos por e-mail inmediatamente y eliminarlo de su sistema. Muchas gracias.