Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Abhishek Girish Thu, 01 Dec 2016 08:49:15 -0800

AFAIK, should apply to all queries, irrespective of the source of the data
or the plugins involved within the query. So when this issue occurs, I
would expect any query to take long to execute.


On Thu, Dec 1, 2016 at 5:47 AM John Omernik <[email protected]> wrote:

> @Abhishek,
>
> Do you think the issue is related to any storage plugin that is enabled and
> not available as it applies to all queries?  I guess if it's an issue where
> all queries are slow because the foreman is waiting to initialize ALL
> storage plugins, regardless of their applicability to the queried data,
> then that is a more general issue (that should still be resolved, does the
> foreman need to initialize all plugins before querying specific data?)
>  However, I am still concerned that the query on the CTAS parquet data is
> specifically slower because of it's source.  @Rahul could you test a
> different Parquet table, NOT loaded from the SQL server to see if the
> enabling or disabling the JDBC storage plugin (with the server unavailable)
> has any impact?  Basically, I want to ensure that data that is created as a
> Parquet table via CTAS is 100% free of any links to the source data. This
> is EXTREMELY important.
>
> John
>
>
>
> On Thu, Dec 1, 2016 at 12:46 AM, Abhishek Girish <
> [email protected]>
> wrote:
>
> > Thanks for the update, Rahul!
> >
> > On Wed, Nov 30, 2016 at 9:45 PM Rahul Raj <
> [email protected]
> > >
> > wrote:
> >
> > > Abhishek,
> > >
> > > Your observation is correct, we just verified that:
> > >
> > >    1. The queries run as expected(faster) with Jdbc plugin disabled.
> > >    2. Queries run as expected when the plugin's datasource is running.
> > >    3. With the datasource down, queries run very slow waiting for the
> > >    connection to fail
> > >
> > > Rahul
> > >
> > > On Thu, Dec 1, 2016 at 10:07 AM, Abhishek Girish <
> > > [email protected]>
> > > wrote:
> > >
> > > > @John,
> > > >
> > > > I agree that this should work. While I am not certain, I don't think
> > the
> > > > issue is specific to a particular plugin, but the way in a query's
> > > > lifecycle, the foreman attempts to initialize every enabled storage
> > > plugin
> > > > before proceeding to execute the query. So when a particular plugin
> > isn't
> > > > configured correctly or the underlying datasource is not up, this
> could
> > > > drastically slow down the query execution time.
> > > >
> > > > I'll look up to see if we have a JIRA for this already - if not will
> > file
> > > > one.
> > > >
> > > > On Wed, Nov 30, 2016 at 8:12 AM, John Omernik <[email protected]>
> > wrote:
> > > >
> > > > > So just my opinion in reading this thread.  (sorry for swooping in
> an
> > > > > opining)
> > > > >
> > > > > If a CTAS is done from any data source into Parquet files.... there
> > > > should
> > > > > be NO dependency on the original data source to query the resultant
> > > > Parquet
> > > > > files.   As a Drill user, as a Drill admin, this breaks the concept
> > of
> > > > > least surprise.  If I take data from one source, and create Parquet
> > > files
> > > > > in a distributed file system, it should just work.  If there are
> > > "issues"
> > > > > with JDBC plugins or the HBase/Hive plugins in a similar manner,
> > these
> > > > > needs to be hunted down by a large group of villages with
> pitchforks
> > > and
> > > > > torches.  I just can't see how this could be acceptable at any
> level.
> > > The
> > > > > whole idea of Parquet files is they are self describing, schema
> > > included
> > > > > files.... thus a read of a directory of Parquet files should have
> NO
> > > > > dependancies on anything but the parquet files... even the Parquet
> > > > > "additions" (such as the METADATA Cache) should be a fail open
> > thing...
> > > > if
> > > > > it exists great, use it, speed things up, but if it doesn't read
> the
> > > > > parquet files as normal (Which I believe is how it operates)
> > > > >
> > > > > John
> > > > >
> > > > > On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish <
> > > > > [email protected]
> > > > > > wrote:
> > > > >
> > > > > > Can you attempt to disable to jdbc plugin (configured with
> > SQLServer)
> > > > and
> > > > > > try the query (on parquet) when SQL Server is offline?
> > > > > >
> > > > > > I've seen a similar issue previously when the HBase / Hive plugin
> > was
> > > > > > enabled but either the plugin configuration was wrong or the
> > > underlying
> > > > > > data source was down.
> > > > > >
> > > > > > On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj
> > > > <rahul.raj@option3consulting.
> > > > > > com>
> > > > > > wrote:
> > > > > >
> > > > > > > I have created a parquet file using CTAS from a MS SQL Server.
> > The
> > > > > query
> > > > > > on
> > > > > > > parquet is getting stuck in STARTING state for a long time
> before
> > > > > > returning
> > > > > > > the results.
> > > > > > >
> > > > > > > We could see that drill was trying to connect to the MS SQL
> > server
> > > > from
> > > > > > > which the data was imported. The MSSQL server was down, drill
> > threw
> > > > an
> > > > > > > exception "Failure while attempting to load JDBC schema", and
> > then
> > > > > > returned
> > > > > > > the results. While SQL server is running, the query executes
> > > without
> > > > > > > issues.
> > > > > > >
> > > > > > > Why is drill querying the DB metadata externally and not the
> > > imported
> > > > > > > parquets?
> > > > > > >
> > > > > > > Rahul.
> > > > > > >
> > > > > > > --
> > > > > > > **** This email and any files transmitted with it are
> > confidential
> > > > and
> > > > > > > intended solely for the use of the individual or entity to whom
> > it
> > > is
> > > > > > > addressed. If you are not the named addressee then you should
> not
> > > > > > > disseminate, distribute or copy this e-mail. Please notify the
> > > sender
> > > > > > > immediately and delete this e-mail from your system.****
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > > --
> > > **** This email and any files transmitted with it are confidential and
> > > intended solely for the use of the individual or entity to whom it is
> > > addressed. If you are not the named addressee then you should not
> > > disseminate, distribute or copy this e-mail. Please notify the sender
> > > immediately and delete this e-mail from your system.****
> > >
> >
>

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Reply via email to