Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Padma Penumarthy Thu, 01 Dec 2016 09:50:46 -0800

Yes, for every query, we build schema tree by trying to initialize
all storage plugins and workspaces in them, regardless of schema configuration 
and/or applicability to data being queried. Go ahead and file a JIRA.
We are looking into fixing this.


Thanks,
Padma


> On Dec 1, 2016, at 8:48 AM, Abhishek Girish <[email protected]> wrote:
> 
> AFAIK, should apply to all queries, irrespective of the source of the data
> or the plugins involved within the query. So when this issue occurs, I
> would expect any query to take long to execute.
> 
> On Thu, Dec 1, 2016 at 5:47 AM John Omernik <[email protected]> wrote:
> 
>> @Abhishek,
>> 
>> Do you think the issue is related to any storage plugin that is enabled and
>> not available as it applies to all queries?  I guess if it's an issue where
>> all queries are slow because the foreman is waiting to initialize ALL
>> storage plugins, regardless of their applicability to the queried data,
>> then that is a more general issue (that should still be resolved, does the
>> foreman need to initialize all plugins before querying specific data?)
>> However, I am still concerned that the query on the CTAS parquet data is
>> specifically slower because of it's source.  @Rahul could you test a
>> different Parquet table, NOT loaded from the SQL server to see if the
>> enabling or disabling the JDBC storage plugin (with the server unavailable)
>> has any impact?  Basically, I want to ensure that data that is created as a
>> Parquet table via CTAS is 100% free of any links to the source data. This
>> is EXTREMELY important.
>> 
>> John
>> 
>> 
>> 
>> On Thu, Dec 1, 2016 at 12:46 AM, Abhishek Girish <
>> [email protected]>
>> wrote:
>> 
>>> Thanks for the update, Rahul!
>>> 
>>> On Wed, Nov 30, 2016 at 9:45 PM Rahul Raj <
>> [email protected]
>>>> 
>>> wrote:
>>> 
>>>> Abhishek,
>>>> 
>>>> Your observation is correct, we just verified that:
>>>> 
>>>>   1. The queries run as expected(faster) with Jdbc plugin disabled.
>>>>   2. Queries run as expected when the plugin's datasource is running.
>>>>   3. With the datasource down, queries run very slow waiting for the
>>>>   connection to fail
>>>> 
>>>> Rahul
>>>> 
>>>> On Thu, Dec 1, 2016 at 10:07 AM, Abhishek Girish <
>>>> [email protected]>
>>>> wrote:
>>>> 
>>>>> @John,
>>>>> 
>>>>> I agree that this should work. While I am not certain, I don't think
>>> the
>>>>> issue is specific to a particular plugin, but the way in a query's
>>>>> lifecycle, the foreman attempts to initialize every enabled storage
>>>> plugin
>>>>> before proceeding to execute the query. So when a particular plugin
>>> isn't
>>>>> configured correctly or the underlying datasource is not up, this
>> could
>>>>> drastically slow down the query execution time.
>>>>> 
>>>>> I'll look up to see if we have a JIRA for this already - if not will
>>> file
>>>>> one.
>>>>> 
>>>>> On Wed, Nov 30, 2016 at 8:12 AM, John Omernik <[email protected]>
>>> wrote:
>>>>> 
>>>>>> So just my opinion in reading this thread.  (sorry for swooping in
>> an
>>>>>> opining)
>>>>>> 
>>>>>> If a CTAS is done from any data source into Parquet files.... there
>>>>> should
>>>>>> be NO dependency on the original data source to query the resultant
>>>>> Parquet
>>>>>> files.   As a Drill user, as a Drill admin, this breaks the concept
>>> of
>>>>>> least surprise.  If I take data from one source, and create Parquet
>>>> files
>>>>>> in a distributed file system, it should just work.  If there are
>>>> "issues"
>>>>>> with JDBC plugins or the HBase/Hive plugins in a similar manner,
>>> these
>>>>>> needs to be hunted down by a large group of villages with
>> pitchforks
>>>> and
>>>>>> torches.  I just can't see how this could be acceptable at any
>> level.
>>>> The
>>>>>> whole idea of Parquet files is they are self describing, schema
>>>> included
>>>>>> files.... thus a read of a directory of Parquet files should have
>> NO
>>>>>> dependancies on anything but the parquet files... even the Parquet
>>>>>> "additions" (such as the METADATA Cache) should be a fail open
>>> thing...
>>>>> if
>>>>>> it exists great, use it, speed things up, but if it doesn't read
>> the
>>>>>> parquet files as normal (Which I believe is how it operates)
>>>>>> 
>>>>>> John
>>>>>> 
>>>>>> On Wed, Nov 30, 2016 at 12:12 AM, Abhishek Girish <
>>>>>> [email protected]
>>>>>>> wrote:
>>>>>> 
>>>>>>> Can you attempt to disable to jdbc plugin (configured with
>>> SQLServer)
>>>>> and
>>>>>>> try the query (on parquet) when SQL Server is offline?
>>>>>>> 
>>>>>>> I've seen a similar issue previously when the HBase / Hive plugin
>>> was
>>>>>>> enabled but either the plugin configuration was wrong or the
>>>> underlying
>>>>>>> data source was down.
>>>>>>> 
>>>>>>> On Fri, Nov 25, 2016 at 3:21 AM, Rahul Raj
>>>>> <rahul.raj@option3consulting.
>>>>>>> com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I have created a parquet file using CTAS from a MS SQL Server.
>>> The
>>>>>> query
>>>>>>> on
>>>>>>>> parquet is getting stuck in STARTING state for a long time
>> before
>>>>>>> returning
>>>>>>>> the results.
>>>>>>>> 
>>>>>>>> We could see that drill was trying to connect to the MS SQL
>>> server
>>>>> from
>>>>>>>> which the data was imported. The MSSQL server was down, drill
>>> threw
>>>>> an
>>>>>>>> exception "Failure while attempting to load JDBC schema", and
>>> then
>>>>>>> returned
>>>>>>>> the results. While SQL server is running, the query executes
>>>> without
>>>>>>>> issues.
>>>>>>>> 
>>>>>>>> Why is drill querying the DB metadata externally and not the
>>>> imported
>>>>>>>> parquets?
>>>>>>>> 
>>>>>>>> Rahul.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> **** This email and any files transmitted with it are
>>> confidential
>>>>> and
>>>>>>>> intended solely for the use of the individual or entity to whom
>>> it
>>>> is
>>>>>>>> addressed. If you are not the named addressee then you should
>> not
>>>>>>>> disseminate, distribute or copy this e-mail. Please notify the
>>>> sender
>>>>>>>> immediately and delete this e-mail from your system.****
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> **** This email and any files transmitted with it are confidential and
>>>> intended solely for the use of the individual or entity to whom it is
>>>> addressed. If you are not the named addressee then you should not
>>>> disseminate, distribute or copy this e-mail. Please notify the sender
>>>> immediately and delete this e-mail from your system.****
>>>> 
>>> 
>>

Re: Slow query on parquet imported from SQL Server while the external SQL server is down.

Reply via email to