Why does SASI index consume such a huge disk space?

zuochangan Tue, 02 Jan 2018 03:28:39 -0800

Hi,all

I use zipkin (https://github.com/openzipkin/zipkin 
<https://github.com/openzipkin/zipkin>) to trace my system.


When I upgraded to the latest version ,3.23 be specific. I met a problem which 
our monitor keep alerting that there is not enough disk space for cassandra.

After some investigation,I found the biggest file is the index file.  And also 
I have googled some blogs like (http://www.doanduyhai.com/blog/?p=2058 
<http://www.doanduyhai.com/blog/?p=2058>).

Which said as belows:


As we can see, using CONTAINS mode can increase the disk usage by x4 - x6. 
Since album titles tends to be a long text, the inflation rate is x6. It will 
be more if we chose the NonTokenizingAnalyzer because the StandardAnalyzer 
splits the text into tokens, remove stop words and perform stemming. All this 
help reducing the total size of the term.

As a conclusion, use CONTAINS mode wisely and be ready to pay the price in term 
of disk space. There is no way to avoid it. Even with efficient search engines 
like ElasticSearch or Solr, it is officially recommended to avoid substring 
search (LIKE %substring%) for the sake of performance.

 
Zipkin2 create index as follows  :



CREATE CUSTOM INDEX IF NOT EXISTS ON zipkin2.span (annotation_query) USING 
'org.apache.cassandra.index.sasi.SASIIndex'
   WITH OPTIONS = {
    'mode': 'CONTAINS',
    'analyzed': 'true',
    
'analyzer_class':'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
    'case_sensitive': 'false'
   };


I cannot understand why it will use more disk space when we choose 
NonTokenizingAnalyzer rather than StandardAnalyzer as analyzer_class.

As I debug the code , there is only one term returned when use 
NonTokenizingAnalyzer



Need Some Help! Thanks a lot

Why does SASI index consume such a huge disk space?

Reply via email to