Thanks for reply, Gopal. Very helpful. On Thu, Aug 4, 2016 at 10:15 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote:
> > where res_url like '%mts.ru%' > ... > > where res_url like '%mts_ru%' > ... > > Why '_' wildcard decrease perfomance? > > Because it misses the fast path by just one "_". > > ORC vectorized reader has a zero-copy check for 3 patterns - prefix, > suffix and middle. > > That means "https://%", "%.html", "%mts.ru%" will hit the fast path - > which uses StringExpr::equal() which JITs into the following. > > https://issues.apache.org/jira/secure/attachment/ > 12748720/string-intrinsic- > sse.png > > > In Hive-2.0, you can mix these up too to get "https:%mts%.html" in a > ChainedChecker. > > > Anything other than these 3 cases becomes a Regex and takes the slow path. > > The pattern you mentioned gets rewritten into ".*mts.ru.*" and the inner > loop has a new String() as the input to the matcher + matcher.matches() in > it. > > I've put in some patches recently which rewrite it Lazy regexes like > ".?*mts.ru.?*", so the regex DFA will be smaller (HIVE-13196). > > That improves the case where the pattern is found, but does nothing to > improve the performance of the new String() GC garbage. > > Cheers, > Gopal > > >