Re: Removing "external data source"

Matthew Jacobs Wed, 07 Feb 2018 11:12:06 -0800

I added the external data source code a number of years back, and we
chose not to document it because we wanted to have the opportunity to
iterate on the interface but it was considered to be production-ready.
In retrospect I think we could have documented it as the API does have
a mechanism for versioning, and arguably we still might want to doc it
as I've seen it used a few times over the years, even without
documentation. My experience with the code was that it wasn't getting
in the way, i.e. it wasn't expensive to maintain and has test coverage
for what it currently offers.


An alternative approach to removing this code might be: review the
original design/API, create more specific improvement tasks (smaller
items may good for aspiring contributors), and document it. Seeing as
I have some context on this work, I'd be happy to help with this.

-mj

On Wed, Feb 7, 2018 at 9:18 AM, Marcel Kornacker <marc...@gmail.com> wrote:
> I agree with Shant, I think this feature has the potential to add some
> interesting new functionality, even without parallelization. As an
> example, an "information schema" feature does not require high
> throughput.
>
> Dan, could you expand on why it's a prototype? What is gained by
> removing the code?
>
> On Wed, Feb 7, 2018 at 9:08 AM, Shant Hovsepian <sh...@arcadiadata.com> wrote:
>> My two cents.
>>
>> Haved used it for testing and prototyping little things, for example a
>> twitter firehouse datasource, or even a generic JDBC wrapper, makes cool
>> demos but not something one would use in for data intensive workloads. It
>> definitely has issues like defining and extracting a schema is tedious, it
>> does not parallelize but that is generally a hard problem. I do think it
>> would be cool to document better and see if the community would come up with
>> fun datasources. It's one feature that SparkSQL and Drill kind of do well
>> that I'd wish to see better support in Impala for. If it is not too much
>> overhead to maintain might be worth keeping.
>>
>> On Wed, Feb 7, 2018 at 8:48 AM Daniel Hecht <dhe...@cloudera.com> wrote:
>>>
>>> As it is implemented today, it doesn't have much value. It never really
>>> passed the prototype stage in terms of functionality.  For instance, it's
>>> not parallelized -- it runs on a single node only.
>>>
>>> On Tue, Feb 6, 2018 at 8:47 PM, Jim Apple <jbap...@cloudera.com> wrote:
>>>>
>>>> Is there an argument for documenting it and keeping it? Did it not meet
>>>> the need it was added for in the first place, or has that need deceased in
>>>> importance?
>>>>
>>>> On Tue, Feb 6, 2018 at 7:29 PM Philip Zeyliger <phi...@cloudera.com>
>>>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> I want to bring your attention to http://gerrit.cloudera.org:8080/9192,
>>>>> "IMPALA-6204: Remove external DataSource". This is functionality that was
>>>>> never publicly documented and, to my knowledge, is not in use by anyone.
>>>>> We'd like to remove it to reduce complexity.
>>>>>
>>>>> Please let me know if you've got concerns!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -- Philip
>>>
>>>
>>

Re: Removing "external data source"

Reply via email to