Re: Drill + Mongo

Ron Cecchini Fri, 06 Mar 2020 05:55:44 -0800

I forgot to mention that Drill had one sweet trick that Presto didn't re: Mongo 
-


I assume (w/o digging into to the code) that you must've "aliased" an 
un-filtered SELECT COUNT(*) to use Mongo's db.<collection>.count() (or 
something...) because it would return immediately with the correct answer, 
whereas Presto was again doing some kind of big scan or whatever and never 
returning.

> On March 6, 2020 at 12:25 AM Ron Cecchini <[email protected]> wrote:
> 
> 
> Thanks as always, Paul.
> 
> Well, technically Drill performs *better* than the (official) Presto, because 
> the prestosql/presto image's mongodb connector is broken at the current 
> time...
> 
> I've been using the starburstdata/presto image because it's apparently more 
> stable.  I'm still a little confused how they achieve that because I 
> *thought* they were building from the same (prestosql) code base, with just 
> some  configuration changes, but I think they must've modified code as well.
> 
> Anyway, as for what kind of query, it's just a fairly vanilla query with a 
> few range tests.
> 
> The (cleaned up and sanitized) Mongo query looks something like:
> 
> { "$and" :
>   [ 
>     { "id" : { "$lt"  : 1000 } },
>     { "d"  : { "$gte" : { "$date" : 1578000000000 } } },
>     { "d"  : { "$lte" : { "$date" : 1578200000000 } } },
>     { "x"  : { "$gte" : 1.0 } },
>     { "x"  : { "$lte" : 4.0 } },
>     { "y"  : { "$gte" : 5.0 } },
>     { "y"  : { "$lte" : 8.0 } }
>   ]
> }
> 
> and then the SQL becomes something simple like:
> 
> SELECT * FROM mongodb.<schema>."<table.name>"
> WHERE ( id < 1000 )
>   AND ( d BETWEEN '2020-01-01T12:00:00Z' AND '2020-01-02T12:00:00Z' )
>   AND ( x BETWEEN 1.0 AND 4.0 )
>   AND ( y BETWEEN 5.0 AND 8.0 )
> OFFSET 0 LIMIT 5000
> 
> > On March 5, 2020 at 2:31 AM Paul Rogers <[email protected]> wrote:
> > 
> > 
> > Hi Ron,
> > 
> > Sounds like the good news is that Drill is about as good as Presto when 
> > querying Mongo. Sounds like the bad news is that both are equally 
> > deficient. On the other hand, the other good news is that better 
> > performance is just a matter of adding additional planning rules (with 
> > perhaps some Mongo metadata.)
> > 
> > 
> > The Wikipedia page for Mongo [1] suggests several features that Mongo 
> > (Simba) is probably using in their own JDBC driver, but which Drill 
> > probably does not use:
> > 
> > * Primary and secondary indices
> > * Field, range query, and regular-expression searches
> > * User-defined JavaScript functions
> > * Three ways to perform aggregation: the aggregation pipeline, the 
> > map-reduce function, and single-purpose aggregation methods.
> > 
> > My guess is that the Mongo JDBC driver does thorough planning to exploit 
> > each of the above functions, while Drill may use only a few. We already 
> > noted other weaknesses in the filter push-down code for the Drill Mongo 
> > plugin. Seems fixable if we can put in the effort.
> > 
> > 
> > Seems Mongo provides a Simba JDBC driver, which is proprietary, so no 
> > source code is available we could use as a "cheat sheet" to see what's what.
> > 
> > 
> > Just out of curiosity, what is the query that works well with the Mongo 
> > JDBC driver, but poorly with Drill?
> > 
> > Anybody know more about how Mongo works and what Drill might be missing?
> > 
> > 
> > Thanks,
> > - Paul
> > 
> > [1] https://en.wikipedia.org/wiki/MongoDB
> > 
> > 
> > 
> >  
> > 
> >     On Wednesday, March 4, 2020, 9:28:44 PM PST, Ron Cecchini 
> > <[email protected]> wrote:  
> >  
> >  Hi, guys.
> > 
> > This is actually more of a Mongo question than a Drill-specific question as 
> > it also applies to Presto + Mongo, and the vanilla Mongo shell as well.
> > 
> > I'm asking here, though, because, well, I'm curious, and because you're the 
> > database geniuses...
> > 
> > So, I essentially get why a NoSQL database, in general, wouldn't be as 
> > performant as a SQL one at "relational" things.  From what I gather, there 
> > are denormalization and optimization techniques and tricks you can use to 
> > speed up a Mongo query and so forth, but my question is:
> > 
> > Why is it that any Drill/Presto + Mongo CLI or JDBC query against a large 
> > collection (100-200 million documents) that includes even a single WHERE 
> > clause, or the Mongo equivalent query made via Mongo shell, basically never 
> > returns and has to be killed, whereas the same (Mongo equivalent) query 
> > against the same collection made via *Mongo's* JDBC driver takes only a 
> > second or two?
> > 
> > Is the Mongo JDBC using some indexing that the others aren't?  (But how 
> > would that explain Mongo shell's non-performance...  Why doesn't Mongo 
> > shell just make a JDBC call to the db...)
> > 
> > Thank you in advance for educating me.
> > 
> > Ron
> >

Re: Drill + Mongo

Reply via email to