Thanks for reply, Gopal. Very helpful.
On Thu, Aug 4, 2016 at 10:15 PM, Gopal Vijayaraghavan
wrote:
> > where res_url like '%mts.ru%'
> ...
> > where res_url like '%mts_ru%'
> ...
> > Why '_' wildcard decrease perfomance?
>
> Because it misses the fast path by just one "_".
>
> where res_url like '%mts.ru%'
...
> where res_url like '%mts_ru%'
...
> Why '_' wildcard decrease perfomance?
Because it misses the fast path by just one "_".
ORC vectorized reader has a zero-copy check for 3 patterns - prefix,
suffix and middle.
That means "https://%;, "%.html", "%mts.ru%"
I've got Hive Transactional table 'data_http' in ORC format, containing
around 100.000.000 rows.
When I execute query:
select * from data_http
where res_url like '%mts.ru%'
it completes in 10 seconds.
But executing query
select * from data_http
where res_url like '%mts_ru%'
takes more than