[Bug 61463] New: Gerrit search: enable secondary index to allow latest operators

bugzilla-daemon Mon, 17 Feb 2014 06:22:25 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=61463


            Bug ID: 61463
           Summary: Gerrit search: enable secondary index to allow latest
                    operators
           Product: Wikimedia
           Version: unspecified
          Hardware: All
               URL: https://gerrit.wikimedia.org/r/#/q/comment:pginer,n,z
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: Git/Gerrit
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected],
                    [email protected], [email protected],
                    [email protected], [email protected],
                    [email protected]
            Blocks: 36467, 49674
       Web browser: ---
   Mobile Platform: ---

Code Review - Error
secondary index must be enabled for comment:pginer
----

Also for operators message: and file: (hence tentatively marking as blocker of
bug 36467, bug 49674).

I hear that «as part of the secondary indexing change its no longer allowed
to run in its naive form. I think that is OK, we are going to
encourage everyone to turn on secondary indexing.»
https://groups.google.com/forum/#!topic/repo-discuss/rCS2ooVEpbg

Is this the infamous Lucene search? If yes, is this a WONTFIX or what blocks it
other than caching
(<https://code.google.com/p/gitblit/issues/detail?id=274#c9>)?

P.s.: Pasting here IRC discussion about the URL above for lack of a better
place. http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-dev/20140217.txt

[10:37:35] <ori>     Krinkle: have you seen mwgrep yet?
[10:38:23] <ori>     ssh to tin and run: mwgrep mw.user.anon
[10:40:40] <hoo>     ori: ooowwwm, super handy... thanks :)
[10:41:09] <hoo>     I've been using a hacky bash script for that until now :P
[10:41:38] <ori>     how did it work?
[10:42:06] <hoo>     ori: It basically just pulled in all JS pages I wan
interested in (mostly common.js) and ran grep on them :P
[10:42:11] <hoo>     * was
[10:42:14] <ori>     ahh
[10:48:49] <Krinkle>     ori: how performant is that (how much is search
engine, how much is local script), you think we could expose this publicly
somehow? 
[10:49:28] <Krinkle>     I wrote two separate toolserver tools over the past
years to do essentially that, all of them extremely hacky and slow.
[10:49:44] <hoo>     hoo@terbium:~$ time mwgrep 453894375dfs
[10:49:44] <hoo>     [...]
[10:49:44] <hoo>     real    0m0.176s
[10:49:52] <ori>     the local script is basically constructing the URL for a
request; it's all elasticsearch
[10:50:09] <Krinkle>     couldn't use any API or search backend for it at the
time, as it mangled code quite badly (punctuation normalisation etc.), and no
way to search multiple wikis
[10:50:11] <ori>     nik (manybubbles) said the load it generates is laughably
small
[10:50:28] <Krinkle>     yeah, looked at the source code already
[10:50:41] <Krinkle>     wasn't sure where ""_source['text'].contains('%s')" %
args.term" ran though
[10:50:42] <hoo>     ori: Does that service used in there have a public host?
[10:50:58] <hoo>     http://search.svc.eqiad.wmnet:9200
[10:51:09] <ori>     you could change it to a regexp search and as long as it
was restricted to the mw ns it would be fine (again, per nik)
[10:51:53] <ori>     it's not publicly-accessible; pathological queries will
run it into the ground, so you need to have a layer in front that restricts the
types of searches
[10:52:02] <ori>     in our case, the mediawiki search interface
[10:52:43] <Krinkle>     ori: Hm.. it currently doesn't take regex though,
right? Would need a minor change to the script. Just verifying I'm not using it
incorrectly
[10:52:56] <ori>     yeah, doesn't take a regexp at the moment
[10:53:29] <ori>     because this particular query is not risky (because ns:8,
title ends with 'js' or 'css' restricts the pool of candidates), we could have
a public front-end for it, but chad's plan is to make the entire elastic index
queryable from labs
[10:53:40] <Nemo_bis>     hoo: does this help the non-protocol-relative scripts
searches?
[10:54:07] <hoo>     Nemo_bis: Not much w/o regexp support, yet
[10:54:13] <Nemo_bis>     ok
[10:54:29] <hoo>     But I already have quite a list of stuff to clean up which
I gathered using my old bash script
[10:55:52] <ori>     Krinkle: http://p.defau.lt/?Yswucu3jR5S9QqZfFre7Vg is the
query (first line is path, everything below is the json query, which should be
sent as POST payload)
[10:56:56] <ori>     Krinkle, hoo: you can replace 'contains' with 'matches' to
make it a regexp search
[10:57:29] <Krinkle>     ori: what script language is that?
[10:57:44] <hoo>     python
[10:57:48] <Krinkle>    
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-script-filter.html
[10:57:56] <Krinkle>     ok
[10:58:24] <ori>     Krinkle: no, MVEL
[10:58:26] <ori>     <http://github.com/mvel/mvel>
[10:58:36] <ori>     context: http://p.defau.lt/?EZ6aPeZRMWdjZtwpOnFgWQ
[10:58:53] <Krinkle>     yeah, it says that
[10:59:05] <Krinkle>     <a>scripts</a> ->
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html
[10:59:09] <Krinkle>     uses by default mvel 
[10:59:24] <ori>     ah, yep
[11:00:27] <hoo>     ori: mh... regexp don't seem to work after I did the
change you suggested...
[11:00:34] <hoo>     or aren't that perl-style regexp?
[11:02:18] <Krinkle>     Im trying to find docs on mvel regarding available
string methods
[11:02:20] <Krinkle>     no luck
[11:03:18] <Krinkle>     http://mvel.codehaus.org/MVEL+2.0+Operators shows a
different syntax ( foo contains "x", and, foo ~= '[a-z].+' )
[11:03:47] <ori>     Krinkle: "By default, like Java, MVEL provides all
java.lang.* classes pre-imported as part of the compiler and runtime."
[11:03:54] <Krinkle>     right
[11:04:23] <ori>     so since _source['text'] is a string, you can call
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#matches(java.lang.String)
[11:05:06] <ori>     I don't know how that is functionally different from using
MVEL's ~= operator
[11:05:10] <Krinkle>     Ah, this is going to be important to document and
protect against. Working 3 or 4 levels of abstraction deep to end up evaluating
java.
[11:05:22] <Krinkle>     java, mvel, json, http
[11:05:52] <ori>     elastic's http interface is designed to be a private api
tho
[11:06:00] <Krinkle>     yeah
[11:06:10] <ori>     but yeah, it is a bit like a matryoshka doll
[11:06:29] <ori>    
https://upload.wikimedia.org/wikipedia/commons/4/45/7babushkadolls.JPG
[11:06:33] <Krinkle>     except that its bigger on the inside
[11:06:34] <ori>     i'll let you decide which one is java :P
[11:06:51] <Krinkle>     http < json < mvel < java
[11:07:16] <Krinkle>     well, kinda sorta
[11:07:17] <Krinkle>     nvm
[11:07:41] <ori>     don't forget shell-style argv parsing > json
[11:08:06] <Krinkle>     yeah, but out final API isn't going to do that :P
[11:08:31] <ori>     at least i hope not
[11:08:59] <Krinkle>     ApiAwesome.php, $apiResult = `"/bin/mwgrep
escapeshellarg($_GET['q'])"` 
[11:09:25] <ori>     I'm sure we could come up with something more hideous than
that
[11:09:33] <ori>     it needs Puppet
[11:09:41] <ori>     and ideally also XML
[11:10:00] <Krinkle>     and php serialise/wakeup
[11:10:37] <ori>     see also
<http://stackoverflow.com/questions/7797217/how-to-index-source-code-with-elasticsearch>
for the "proper" way to do code searches without resulting to this sort of
trickery
[11:12:17] <ori>     basically teach lucene to run
<https://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/js/Parser.java>
against the text and index names
[11:12:35] * ori  chuckles at 'import
com.google.caja.SomethingWidgyHappenedError;'
[11:13:56] <Nemo_bis>     If ElasticSearch can be used for proper code
searches, maybe gitblit can be convinced to use it instead of Lucene?
[11:14:23] <Nemo_bis>     If yes please file a bug upstream, the request for
gitweb-like grep search was rejected.
[11:14:37] <ori>     elastic uses lucene under the hood
[11:16:04] <ori>     I've been poking at its overall performance
(<https://code.google.com/p/gitblit/issues/detail?id=274#c9>), it's hideously
slow atm
[11:18:37] <Nemo_bis>     I know it uses Lucene, hence I'd think it would be
slightly easier to convince them to adopt it. :)
[11:19:21] <Nemo_bis>     Oh, thanks for the link, I thought I had starred that
one.
[11:22:47] <Nemo_bis>     And thanks for working on that

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 61463] New: Gerrit search: enable secondary index to allow latest operators

Reply via email to