Why does SASI index consume such a huge disk space?

2018-01-02 Thread zuochangan
Hi,all

I use zipkin (https://github.com/openzipkin/zipkin 
) to trace my system. 

When I upgraded to the latest version ,3.23 be specific. I met a problem which 
our monitor keep alerting that there is not enough disk space for cassandra.

After some investigation,I found the biggest file is the index file.  And also 
I have googled some blogs like (http://www.doanduyhai.com/blog/?p=2058 
).

Which said as belows:


As we can see, using CONTAINS mode can increase the disk usage by x4 - x6. 
Since album titles tends to be a long text, the inflation rate is x6. It will 
be more if we chose the NonTokenizingAnalyzer because the StandardAnalyzer 
splits the text into tokens, remove stop words and perform stemming. All this 
help reducing the total size of the term.

As a conclusion, use CONTAINS mode wisely and be ready to pay the price in term 
of disk space. There is no way to avoid it. Even with efficient search engines 
like ElasticSearch or Solr, it is officially recommended to avoid substring 
search (LIKE %substring%) for the sake of performance.

 
Zipkin2 create index as follows  :



CREATE CUSTOM INDEX IF NOT EXISTS ON zipkin2.span (annotation_query) USING 
'org.apache.cassandra.index.sasi.SASIIndex'
   WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzed': 'true',

'analyzer_class':'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
   };


I cannot understand why it will use more disk space when we choose 
NonTokenizingAnalyzer rather than StandardAnalyzer as analyzer_class.

As I debug the code , there is only one term returned when use 
NonTokenizingAnalyzer



Need Some Help! Thanks a lot








Re: Difference between drop and truncate

2017-12-20 Thread zuochangan
Detailed explanation, thanks 

> 在 2017年12月21日,上午11:49,Jeff Jirsa  写道:
> 
> Assume you’re running 3.0 or 3.x - there’s a patch that’ll be in the next 
> releases that speed up truncate significantly - there’s some slowish code in 
> adding the sstables to the transaction log before deleting them, but it’ll be 
> much faster. 
> 
> Truncate marks all the data as removed, and then removes it. It requires all 
> hosts be online. It removes the risk of conflicting CFIDs. It reduces races.
> 
> Dropping the table isn’t guaranteed to hit all hosts at the same time, and 
> recreating it won’t either. It’s far more likely you could have data lost if 
> you’re writing as you drop and recreate vs using truncate. 
> 
> 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Dec 20, 2017, at 7:41 PM, changan zuo  > wrote:
> 
>> Hi,guys
>>We have a very big table. when I excute "truncate table" it takes such a 
>> long time at last show "request timeout ". 
>>well, if I execute drop table which complete very quickly.
>>I cannot see the big difference since they both delete the data.
>>
>>Anyone can explain it to me ? thanks a lot
>> 
>> 
>> 
>>