SQL DB Loader?

Aaron Kimball Fri, 05 Nov 2010 10:25:45 -0700

Sqoop does generate a class on the client machine which is then shipped to
the cluster during the processing phase. See
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_basic_usage for
some more details about this process.


Instances of this class may be marshaled into SequenceFiles if you'd like to
keep your data in binary form. If you're storing your data as text (the
default), the generated class is discarded after the import. Then you can
use the regular text-based loader in Pig, or TextInputFormat in MapReduce,
etc.

If you want to store your data in a binary encoding (SequenceFiles) and
still use it in Pig, you'd need to write your own loader. This should be
relatively straightforward; you'd just need to read the records out of
SequenceFiles into instances of the generated class (which could be
specified as a parameter to the loader). Generated classes in Sqoop fulfill
the interface FieldMappable (
https://github.com/cloudera/sqoop/blob/master/src/java/com/cloudera/sqoop/lib/FieldMappable.java)
which allows you to iterate over the fields in the record. I'm not a Pig
expert, but I doubt this would be too hard to convert to a map-based type
used more broadly in Pig.

Good luck
- Aaron

2010/11/5 Anze <[email protected]>

>
> > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty
> > > easy, since iirc Sqoop does generate an input format for you.
>
> Yes, but if I remember correctly (I have looked at Sqoop quite some time
> ago)
> Sqoop generates classes based on SQL the user provides. Unless you suggest
> using input format classes only as a starting point? That would probably
> work...
>
> > > Good project for someone looking to get started in contributing to Pig
>
> It is tempting. :)
>
> > Also, Sqoop only needs to be installed on the client machine; it doesn't
> > require modifying your Hadoop deployment on your servers anywhere. If
> > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's
> > likely you've already got the JDK on this machine already.
>
> I am running Pig remotely, not from a local machine. But this will be
> changed
> soon so this will not be a problem for me anymore.
>
> Thanks,
>
> Anze
>
>
> On Friday 05 November 2010, Aaron Kimball wrote:
> > By default, Sqoop should load files into HDFS as delimited text. The
> > existing PigStorage should be able to work with the data loaded here.
> > Similarly, if you use PigStorage to write data back to HDFS in delimited
> > text form, Sqoop can export those files to your RDBMS.
> >
> > Also, Sqoop only needs to be installed on the client machine; it doesn't
> > require modifying your Hadoop deployment on your servers anywhere. If
> > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's
> > likely you've already got the JDK on this machine already.
> >
> > - Aaron
> >
> > On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[email protected]>
> wrote:
> > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty
> > > easy, since iirc Sqoop does generate an input format for you.
> > >
> > > Good project for someone looking to get started in contributing to Pig
> > > ...
> > >
> > > :)
> > >
> > > -D
> > >
> > > 2010/11/4 Anze <[email protected]>
> > >
> > > > Hi Arvind!
> > > >
> > > > Should we take this discussion off the list? It is not really
> > > > Pig-related anymore... Not sure what the custom is around here. :)
> > > >
> > > > > > process. Apart from just the speed, Sqoop offers many other
> > >
> > > advantages
> > >
> > > > > > too such as incremental loads, exporting data from HDFS back to
> the
> > > > > > database, automatic creation of Hive tables or populating hbase
> > > > > > etc.
> > > >
> > > > Only Pig is missing then... >:-D
> > > > Sorry, couldn't hold that back... ;)
> > > >
> > > > I would love to use Sqoop for another task (periodically importing
> > > > MySQL tables to HBase) if schema gets more or less preserved, however
> > > > I don't dare
> > > > upgrade JRE to JDK at the moment in fear of breaking things.
> > > >
> > > > > Anze - I just checked that our Sqoop packages do declare the JDK
> > > > > dependency. Which package did you see as not having this
> dependency?
> > > >
> > > > We are using:
> > > > -----
> > > > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib
> > > > -----
> > > > But there is no sqoop package per se, I guess it is part of hadoop
> > >
> > > package:
> > > > -----
> > > > $ aptitude show hadoop-0.20 | grep Depends
> > > > Depends: adduser, sun-java6-jre, sun-java6-bin
> > > > -----
> > > > $ aptitude search sun-java6 | grep "jdk\|jre"
> > > > p   sun-java6-jdk                   - Sun Java(TM) Development Kit
> > > > (JDK)
> > >
> > > 6
> > >
> > > > i A sun-java6-jre                   - Sun Java(TM) Runtime
> Environment
> > > > (JRE) 6
> > > > -----
> > > >
> > > >
> > > > This is where aa...@cloudera advises that JDK is needed (instead of
> > > > JRE) for
> > >
> > > > successful running of sqoop:
> > >
> http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio
> > >
> n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep
> > > tion-
> > >
> > > > j7ziz<
> > >
> > >
> http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio
> > >
> n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep
> > > tion-%0Aj7ziz
> > >
> > > > As I said, I am interested in Sqoop (and alternatives) as we will be
> > >
> > > facing
> > >
> > > > the problem in near future, so I appreciate your involvement in this
> > > > thread!
> > > >
> > > > Anze
> > > >
> > > > On Thursday 04 November 2010, [email protected] wrote:
> > > > > Anze - I just checked that our Sqoop packages do declare the JDK
> > > > > dependency. Which package did you see as not having this
> dependency?
> > > > >
> > > > > Arvind
> > > > >
> > > > > On Thu, Nov 4, 2010 at 9:25 AM, [email protected]
> > > >
> > > > <[email protected]>wrote:
> > > > > > Sqoop is Java based and you should have JDK 1.6 or higher
> available
> > >
> > > on
> > >
> > > > > > your system. We will add this as a dependency for the package.
> > > > > >
> > > > > > Regarding accessing MySQL from a cluster - it should not be a
> > > > > > problem
> > > >
> > > > if
> > > >
> > > > > > you control the number of tasks that do that. Sqoop allows you to
> > > > > > explicitly specify the number of mappers, where each mapper holds
> a
> > > > > > a connection to the database and effectively parallelizes  the
> > > > > > loading process. Apart from just the speed, Sqoop offers many
> > > > > > other
> > >
> > > advantages
> > >
> > > > > > too such as incremental loads, exporting data from HDFS back to
> the
> > > > > > database, automatic creation of Hive tables or populating hbase
> > > > > > etc.
> > > > > >
> > > > > > Arvind
> > > > > >
> > > > > > 2010/11/4 Anze <[email protected]>
> > > > > >
> > > > > >> So Sqoop doesn't require JDK?
> > > > > >> It seemed weird to me too. Also, if it would require it, then
> JDK
> > > >
> > > > would
> > > >
> > > > > >> probably have to be among dependencies of the package Sqoop is
> in.
> > > > > >>
> > > > > >> I started working on DBLoader, but the learning curve seems
> quite
> > > >
> > > > steep
> > > >
> > > > > >> and I
> > > > > >> don't have enough time for it right now. Also, as Ankur said, it
> > >
> > > might
> > >
> > > > > >> not be
> > > > > >> a good idea to hit MySQL from the cluster.
> > > > > >>
> > > > > >> The ideal solution IMHO would be loading data from MySQL to HDFS
> > >
> > > from
> > >
> > > > a
> > > >
> > > > > >> single
> > > > > >> machine (but within LoadFunc, of course) and work with the data
> > > > > >> from there (with schema automatically converted from MySQL). But
> > > > > >> I don't know enough about Pig to do that kind of thing... yet.
> :)
> > > > > >>
> > > > > >> Anze
> > > > > >>
> > > > > >> On Wednesday 03 November 2010, [email protected] wrote:
> > > > > >> > Sorry that you ran into a problem. Typically, it is usually
> > > >
> > > > something
> > > >
> > > > > >> like
> > > > > >>
> > > > > >> > missing a required option etc that could cause this and if you
> > >
> > > were
> > >
> > > > to
> > > >
> > > > > >> send
> > > > > >>
> > > > > >> > a mail to [email protected], you would get prompt
> > >
> > > assistance.
> > >
> > > > > >> > Regardless, if you still have any use cases like this, I will
> be
> > > >
> > > > glad
> > > >
> > > > > >> > to help you out in using Sqoop for that purpose.
> > > > > >> >
> > > > > >> > Arvind
> > > > > >> >
> > > > > >> > 2010/11/3 Anze <[email protected]>
> > > > > >> >
> > > > > >> > > I tried to run it, got NullPointerException, searched the
> net,
> > > >
> > > > found
> > > >
> > > > > >> > > Sqoop requires JDK (instead of JRE) and gave up. I am
> working
> > > > > >> > > on
> > >
> > > a
> > >
> > > > > >> > > production cluster - so I'd rather not upgrade to JDK if not
> > > > > >>
> > > > > >> necessary.
> > > > > >>
> > > > > >> > > :)
> > > > > >> > >
> > > > > >> > > But I was able export MySQL with a simple bash script:
> > > > > >> > > **********
> > > > > >> > > #!/bin/bash
> > > > > >> > >
> > > > > >> > > MYSQL_TABLES=( table1 table2 table3 )
> > > > > >> > > WHERE=/home/hadoop/pig
> > > > > >> > >
> > > > > >> > > for i in ${mysql_tabl...@]}
> > > > > >> > > do
> > > > > >> > >
> > > > > >> > >  mysql -BAN -h <mysql_host> -u <username> --password=<pass>
> > > > > >> > >  <database>
> > > > > >>
> > > > > >> \
> > > > > >>
> > > > > >> > >    -e "select * from $i;" --skip-column-names >
> $WHERE/$i.csv
> > > > > >> > >
> > > > > >> > >  hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/
> > > > > >> > >  rm $WHERE/$i.csv
> > > > > >> > >
> > > > > >> > > done
> > > > > >> > > **********
> > > > > >> > >
> > > > > >> > > Of course, in my case the tables were small enough so I
> could
> > > > > >> > > do
> > > >
> > > > it.
> > > >
> > > > > >> And
> > > > > >>
> > > > > >> > > of course I lost schema in process.
> > > > > >> > >
> > > > > >> > > Hope it helps someone else too...
> > > > > >> > >
> > > > > >> > > Anze
> > > > > >> > >
> > > > > >> > > On Wednesday 03 November 2010, [email protected] wrote:
> > > > > >> > > > Anze,
> > > > > >> > > >
> > > > > >> > > > Did you get a chance to try out Sqoop? If not, I would
> > >
> > > encourage
> > >
> > > > > >> > > > you
> > > > > >>
> > > > > >> to
> > > > > >>
> > > > > >> > > do
> > > > > >> > >
> > > > > >> > > > so. Here is a link to the user
> > > > > >> > > > guide<
> > > >
> > > > http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html>
> > > >
> > > > > >> > > > .
> > > > > >> > > >
> > > > > >> > > > Sqoop allows you to easily move data across from
> relational
> > > > > >>
> > > > > >> databases
> > > > > >>
> > > > > >> > > > and other enterprise systems to HDFS and back.
> > > > > >> > > >
> > > > > >> > > > Arvind
> > > > > >> > > >
> > > > > >> > > > 2010/11/3 Anze <[email protected]>
> > > > > >> > > >
> > > > > >> > > > > Alejandro, thanks for answering!
> > > > > >> > > > >
> > > > > >> > > > > I was hoping it could be done directly from Pig, but...
> :)
> > > > > >> > > > >
> > > > > >> > > > > I'll take a look at Sqoop then, and if that doesn't
> help,
> > >
> > > I'll
> > >
> > > > > >> just
> > > > > >>
> > > > > >> > > write
> > > > > >> > >
> > > > > >> > > > > a simple batch to export data to TXT/CSV. Thanks for the
> > > > > >> > > > > pointer!
> > > > > >> > > > >
> > > > > >> > > > > Anze
> > > > > >> > > > >
> > > > > >> > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote:
> > > > > >> > > > > > Not a 100% Pig solution, but you could use Sqoop to
> get
> > >
> > > the
> > >
> > > > > >> > > > > > data
> > > > > >>
> > > > > >> in
> > > > > >>
> > > > > >> > > as
> > > > > >> > >
> > > > > >> > > > > > a pre-processing step. And if you want to handle all
> as
> > > >
> > > > single
> > > >
> > > > > >> job,
> > > > > >>
> > > > > >> > > > > > you
> > > > > >> > > > >
> > > > > >> > > > > could
> > > > > >> > > > >
> > > > > >> > > > > > use Oozie to create a workflow that does Sqoop and
> then
> > >
> > > your
> > >
> > > > > >> > > > > > Pig processing.
> > > > > >> > > > > >
> > > > > >> > > > > > Alejandro
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze
> > > > > >> > > > > > <[email protected]>
> > > > > >>
> > > > > >> wrote:
> > > > > >> > > > > > > Hi!
> > > > > >> > > > > > >
> > > > > >> > > > > > > Part of data I have resides in MySQL. Is there a
> > > > > >> > > > > > > loader
> > > >
> > > > that
> > > >
> > > > > >> > > > > > > would
> > > > > >> > > > >
> > > > > >> > > > > allow
> > > > > >> > > > >
> > > > > >> > > > > > > loading directly from it?
> > > > > >> > > > > > >
> > > > > >> > > > > > > I can't find anything on the net, but it seems to me
> > >
> > > this
> > >
> > > > > >> > > > > > > must
> > > > > >>
> > > > > >> be
> > > > > >>
> > > > > >> > > > > > > a
> > > > > >> > > > >
> > > > > >> > > > > quite
> > > > > >> > > > >
> > > > > >> > > > > > > common problem.
> > > > > >> > > > > > > I checked piggybank but there is only DBStorage (and
> > > > > >> > > > > > > no DBLoader).
> > > > > >> > > > > > >
> > > > > >> > > > > > > Is some DBLoader out there too?
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Anze
>
>

Re: MySQL / JDBC / SQL DB Loader?

Reply via email to