[jira] [Commented] (SQOOP-365) Proposal for next major revision of Sqoop.

Aaron Kimball (Commented) (JIRA) Sun, 16 Oct 2011 16:00:37 -0700

    [ 
https://issues.apache.org/jira/browse/SQOOP-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13128536#comment-13128536
 ]


Aaron Kimball commented on SQOOP-365:
-------------------------------------

This proposal looks like a good start! Here are some questions I have about it:

* One of the main advantages of Sqoop in it's current form is its ease of 
deployment by end-users. Like Pig, it can be installed on a client machine 
without burdening cluster operators.
** How will we maintain this ease-of-deployment in the face of the web-based 
app? Can/will Sqoop come with a self-contained server (e.g. Jetty?) to support 
'localhost' execution of the web app?
** I like the idea of pre-defined connections. But will Sqoop still support the 
ability to use the existing 'ad hoc' connection mechanism? For users who 
already have a username/password they can use to connect to a database, it may 
be useful for them to get started easily with their existing credentials, 
without requiring an operator to configure a connection.
* Many production deployments count on running Sqoop in commnad-line mode using 
the existing command-line arguments to specify the job. Will Sqoop2 be 
backwards-compatible with these arguments?
* How and where does Sqoop store information about Connections, resource 
limits, etc?
** How, if at all, do we guard against end-users starting a second Sqoop server 
to get around resource limits? Are the resource limits and temporary locking 
info, etc, stored in the target database itself? (If so, how do we guard 
against stale locks..?)

I also don't believe that it's productive for the command-line client to use 
the REST API directly. Starting a server (even on localhost) as a pre-req for 
running a command-line tool seems overly complicated to me.

I think a better architecture may be to define a number of Operations 
internally. Each Operation can have a programmatic (Java) API that executes it. 
Each Operation can also be bound to a REST API endpoint. But this way a user 
can still simply run the command-line application without configuring an entire 
server. The command-line app would run the Operation directly, as opposed to 
running it in the address space of a separate process somewhere. This would 
reduce the number of layers of complexity when debugging what goes wrong. 
Involving the network (even loopback) where none is needed seems like asking 
for trouble.

Finally, on the front of API compatibility: Arvind, in an offline discussion, 
we talked about having a separate API package of interfaces that would have 
"api level" versioning (a la the Servlet API) that is distinct from the 
implementation version. Is that still part of your vision for Sqoop 2? I don't 
see it described in this proposal.

I looked through the proposed source layout for this. Without a README 
specifying what goes in which directories, it's hard for me to understand what 
you're trying to accomplish. What's the "infra" project for?

I think based on what I said above about Operations, etc, there should be a 
"libsqoop" project that corresponds to the guts of the project. The "server" 
should just be a REST API implementation (perhaps w/ an embedded Jetty server, 
but also perhaps deployable as a WAR on a fully-administered Tomcat instance) 
that embeds libsqoop to perform the Operations. And the client, similarly, is a 
thin command-line-arg parsing shell that embeds libsqoop to perform Operations 
directly.

Is infra ~= libsqoop in this idea? Or is that about independent testing of 
connectors, etc?

I think there should also be a plugin-api library (libsqoopapi?) which the 
connector/*/ projects link against, rather than libsqoop itself. This API would 
also be used by third-party SqoopTool implementations.

This document's off to a great start -- this is definitely in line with the 
next evolution of Sqoop as a first-class mechanism for getting data into 
Hadoop. Looking forward to your answers!

Cheers,
Aaron

                
> Proposal for next major revision of Sqoop.
> ------------------------------------------
>
>                 Key: SQOOP-365
>                 URL: https://issues.apache.org/jira/browse/SQOOP-365
>             Project: Sqoop
>          Issue Type: Wish
>            Reporter: Arvind Prabhakar
>            Assignee: Arvind Prabhakar
>         Attachments: sqoop2.tar.gz
>
>
> This issue tracks the design and development of the next major revision of 
> Sqoop. The proposal has been articulated on the wiki at the following 
> location:
> https://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2
> Please review the proposal and add your comments to this JIRA. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (SQOOP-365) Proposal for next major revision of Sqoop.

Reply via email to