Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Abhishek Girish Thu, 01 Dec 2016 10:59:54 -0800

Thanks for confirming Padma.

I've filed DRILL-5089 <https://issues.apache.org/jira/browse/DRILL-5089> to
track this issue.


On Thu, Dec 1, 2016 at 9:50 AM, Padma Penumarthy <[email protected]>
wrote:

> Yes, for every query, we build schema tree by trying to initialize
> all storage plugins and workspaces in them, regardless of schema
> configuration
> and/or applicability to data being queried. Go ahead and file a JIRA.
> We are looking into fixing this.
>
> Thanks,
> Padma
>
>
> > On Dec 1, 2016, at 8:48 AM, Abhishek Girish <[email protected]>
> wrote:
> >
> > AFAIK, should apply to all queries, irrespective of the source of the
> data
> > or the plugins involved within the query. So when this issue occurs, I
> > would expect any query to take long to execute.
> >
> > On Thu, Dec 1, 2016 at 5:47 AM John Omernik <[email protected]> wrote:
> >
> >> @Abhishek,
> >>
> >> Do you think the issue is related to any storage plugin that is enabled
> and
> >> not available as it applies to all queries?  I guess if it's an issue
> where
> >> all queries are slow because the foreman is waiting to initialize ALL
> >> storage plugins, regardless of their applicability to the queried data,
> >> then that is a more general issue (that should still be resolved, does
> the
> >> foreman need to initialize all plugins before querying specific data?)
> >> However, I am still concerned that the query on the CTAS parquet data is
> >> specifically slower because of it's source.  @Rahul could you test a
> >> different Parquet table, NOT loaded from the SQL server to see if the
> >> enabling or disabling the JDBC storage plugin (with the server
> unavailable)
> >> has any impact?  Basically, I want to ensure that data that is created
> as a
> >> Parquet table via CTAS is 100% free of any links to the source data.
> This
> >> is EXTREMELY important.
> >>
> >> John
> >>
> >>
> >>
> >> On Thu, Dec 1, 2016 at 12:46 AM, Abhishek Girish <
> >> [email protected]>
> >> wrote:
> >>
> >>> Thanks for the update, Rahul!
> >>>
> >>> On Wed, Nov 30, 2016 at 9:45 PM Rahul Raj <
> >> [email protected]
> >>>>
> >>> wrote:
> >>>
> >>>> Abhishek,
> >>>>
> >>>> Your observation is correct, we just verified that:
> >>>>
> >>>>   1. The queries run as expected(faster) with Jdbc plugin disabled.
> >>>>   2. Queries run as expected when the plugin's datasource is running.
> >>>>   3. With the datasource down, queries run very slow waiting for the
> >>>>   connection to fail
> >>>>
> >>>> Rahul
> >>>>
> >>>> On Thu, Dec 1, 2016 at 10:07 AM, Abhishek Girish <
> >>>> [email protected]>
> >>>> wrote:
> >>>>
> >>>>> @John,
> >>>>>
> >>>>> I agree that this should work. While I am not certain, I don't think
> >>> the
> >>>>> issue is specific to a particular plugin, but the way in a query's
> >>>>> lifecycle, the foreman attempts to initialize every enabled storage
> >>>> plugin
> >>>>> before proceeding to execute the query. So when a particular plugin
> >>> isn't
> >>>>> configured correctly or the underlying datasource is not up, this
> >> could
> >>>>> drastically slow down the query execution time.
> >>>>>
> >>>>> I'll look up to see if we have a JIRA for this already - if not will
> >>> file
> >>>>> one.
> >>>>>
> >>>>> On Wed, Nov 30, 2016 at 8:12 AM, John Omernik <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> So just my opinion in reading this thread.  (sorry for swooping in
> >> an
> >>>>>> opining)
> >>>>>>
> >>>>>> If a CTAS is done from any data source into Parquet files.... there
> >>>>> should
> >>>>>> be NO dependency on the original data source to query the resultant
> >>>>> Parquet
> >>>>>> files.   As a Drill user, as a Drill admin, this breaks the concept
> >>> of
> >>>>>> least surprise.  If I take data from one source, and create Parquet
> >>>> files
> >>>>>> in a distributed file system, it should just work.  If there are
> >>>> "issues"
> >>>>>> with JDBC plugins or the HBase/Hive plugins in a similar manner,
> >>> these
> >>>>>> needs to be hunted down by a large group of villages with
> >> pitchforks
> >>>> and
> >>>>>> torches.  I just can't see how this could be acceptable at any
> >> level.
> >>>> The
> >>>>>> whole idea of Parquet files is they are self describing, schema
> >>>> included
> >>>>>> files.... thus a read of a directory of Parquet files should have
> >> NO
> >>>>>> dependancies on anything but the parquet files... even the Parquet
> >>>>>> "additions" (such as the METADATA Cache) should be a fail open
> >>> thing...
> >>>>> if
> >>>>>> it exists great, use it, speed things up, but if it doesn't read
> >> the
> >>>>>> parquet files as normal (Which I believe is how it operates)
> >>>>>>
> >>>>>> John
> >>>>>>
> >>>>>> On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish <
> >>>>>> [email protected]
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Can you attempt to disable to jdbc plugin (configured with
> >>> SQLServer)
> >>>>> and
> >>>>>>> try the query (on parquet) when SQL Server is offline?
> >>>>>>>
> >>>>>>> I've seen a similar issue previously when the HBase / Hive plugin
> >>> was
> >>>>>>> enabled but either the plugin configuration was wrong or the
> >>>> underlying
> >>>>>>> data source was down.
> >>>>>>>
> >>>>>>> On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj
> >>>>> <rahul.raj@option3consulting.
> >>>>>>> com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I have created a parquet file using CTAS from a MS SQL Server.
> >>> The
> >>>>>> query
> >>>>>>> on
> >>>>>>>> parquet is getting stuck in STARTING state for a long time
> >> before
> >>>>>>> returning
> >>>>>>>> the results.
> >>>>>>>>
> >>>>>>>> We could see that drill was trying to connect to the MS SQL
> >>> server
> >>>>> from
> >>>>>>>> which the data was imported. The MSSQL server was down, drill
> >>> threw
> >>>>> an
> >>>>>>>> exception "Failure while attempting to load JDBC schema", and
> >>> then
> >>>>>>> returned
> >>>>>>>> the results. While SQL server is running, the query executes
> >>>> without
> >>>>>>>> issues.
> >>>>>>>>
> >>>>>>>> Why is drill querying the DB metadata externally and not the
> >>>> imported
> >>>>>>>> parquets?
> >>>>>>>>
> >>>>>>>> Rahul.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> **** This email and any files transmitted with it are
> >>> confidential
> >>>>> and
> >>>>>>>> intended solely for the use of the individual or entity to whom
> >>> it
> >>>> is
> >>>>>>>> addressed. If you are not the named addressee then you should
> >> not
> >>>>>>>> disseminate, distribute or copy this e-mail. Please notify the
> >>>> sender
> >>>>>>>> immediately and delete this e-mail from your system.****
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> **** This email and any files transmitted with it are confidential and
> >>>> intended solely for the use of the individual or entity to whom it is
> >>>> addressed. If you are not the named addressee then you should not
> >>>> disseminate, distribute or copy this e-mail. Please notify the sender
> >>>> immediately and delete this e-mail from your system.****
> >>>>
> >>>
> >>
>
>

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Reply via email to