I forgot to mention that Drill had one sweet trick that Presto didn't re: Mongo -
I assume (w/o digging into to the code) that you must've "aliased" an un-filtered SELECT COUNT(*) to use Mongo's db.<collection>.count() (or something...) because it would return immediately with the correct answer, whereas Presto was again doing some kind of big scan or whatever and never returning. > On March 6, 2020 at 12:25 AM Ron Cecchini <[email protected]> wrote: > > > Thanks as always, Paul. > > Well, technically Drill performs *better* than the (official) Presto, because > the prestosql/presto image's mongodb connector is broken at the current > time... > > I've been using the starburstdata/presto image because it's apparently more > stable. I'm still a little confused how they achieve that because I > *thought* they were building from the same (prestosql) code base, with just > some configuration changes, but I think they must've modified code as well. > > Anyway, as for what kind of query, it's just a fairly vanilla query with a > few range tests. > > The (cleaned up and sanitized) Mongo query looks something like: > > { "$and" : > [ > { "id" : { "$lt" : 1000 } }, > { "d" : { "$gte" : { "$date" : 1578000000000 } } }, > { "d" : { "$lte" : { "$date" : 1578200000000 } } }, > { "x" : { "$gte" : 1.0 } }, > { "x" : { "$lte" : 4.0 } }, > { "y" : { "$gte" : 5.0 } }, > { "y" : { "$lte" : 8.0 } } > ] > } > > and then the SQL becomes something simple like: > > SELECT * FROM mongodb.<schema>."<table.name>" > WHERE ( id < 1000 ) > AND ( d BETWEEN '2020-01-01T12:00:00Z' AND '2020-01-02T12:00:00Z' ) > AND ( x BETWEEN 1.0 AND 4.0 ) > AND ( y BETWEEN 5.0 AND 8.0 ) > OFFSET 0 LIMIT 5000 > > > On March 5, 2020 at 2:31 AM Paul Rogers <[email protected]> wrote: > > > > > > Hi Ron, > > > > Sounds like the good news is that Drill is about as good as Presto when > > querying Mongo. Sounds like the bad news is that both are equally > > deficient. On the other hand, the other good news is that better > > performance is just a matter of adding additional planning rules (with > > perhaps some Mongo metadata.) > > > > > > The Wikipedia page for Mongo [1] suggests several features that Mongo > > (Simba) is probably using in their own JDBC driver, but which Drill > > probably does not use: > > > > * Primary and secondary indices > > * Field, range query, and regular-expression searches > > * User-defined JavaScript functions > > * Three ways to perform aggregation: the aggregation pipeline, the > > map-reduce function, and single-purpose aggregation methods. > > > > My guess is that the Mongo JDBC driver does thorough planning to exploit > > each of the above functions, while Drill may use only a few. We already > > noted other weaknesses in the filter push-down code for the Drill Mongo > > plugin. Seems fixable if we can put in the effort. > > > > > > Seems Mongo provides a Simba JDBC driver, which is proprietary, so no > > source code is available we could use as a "cheat sheet" to see what's what. > > > > > > Just out of curiosity, what is the query that works well with the Mongo > > JDBC driver, but poorly with Drill? > > > > Anybody know more about how Mongo works and what Drill might be missing? > > > > > > Thanks, > > - Paul > > > > [1] https://en.wikipedia.org/wiki/MongoDB > > > > > > > > > > > > On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini > > <[email protected]> wrote: > > > > Hi, guys. > > > > This is actually more of a Mongo question than a Drill-specific question as > > it also applies to Presto + Mongo, and the vanilla Mongo shell as well. > > > > I'm asking here, though, because, well, I'm curious, and because you're the > > database geniuses... > > > > So, I essentially get why a NoSQL database, in general, wouldn't be as > > performant as a SQL one at "relational" things. From what I gather, there > > are denormalization and optimization techniques and tricks you can use to > > speed up a Mongo query and so forth, but my question is: > > > > Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large > > collection (100-200 million documents) that includes even a single WHERE > > clause, or the Mongo equivalent query made via Mongo shell, basically never > > returns and has to be killed, whereas the same (Mongo equivalent) query > > against the same collection made via *Mongo's* JDBC driver takes only a > > second or two? > > > > Is the Mongo JDBC using some indexing that the others aren't? (But how > > would that explain Mongo shell's non-performance... Why doesn't Mongo > > shell just make a JDBC call to the db...) > > > > Thank you in advance for educating me. > > > > Ron > >
