> > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > easy, since iirc Sqoop does generate an input format for you.
Yes, but if I remember correctly (I have looked at Sqoop quite some time ago) Sqoop generates classes based on SQL the user provides. Unless you suggest using input format classes only as a starting point? That would probably work... > > Good project for someone looking to get started in contributing to Pig It is tempting. :) > Also, Sqoop only needs to be installed on the client machine; it doesn't > require modifying your Hadoop deployment on your servers anywhere. If > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > likely you've already got the JDK on this machine already. I am running Pig remotely, not from a local machine. But this will be changed soon so this will not be a problem for me anymore. Thanks, Anze On Friday 05 November 2010, Aaron Kimball wrote: > By default, Sqoop should load files into HDFS as delimited text. The > existing PigStorage should be able to work with the data loaded here. > Similarly, if you use PigStorage to write data back to HDFS in delimited > text form, Sqoop can export those files to your RDBMS. > > Also, Sqoop only needs to be installed on the client machine; it doesn't > require modifying your Hadoop deployment on your servers anywhere. If > you're writing any Java MapReduce programs, or Java UDFs for Pig, it's > likely you've already got the JDK on this machine already. > > - Aaron > > On Thu, Nov 4, 2010 at 3:22 PM, Dmitriy Ryaboy <[email protected]> wrote: > > I imagine writing a Pig 0.7+ loader for the Sqoop files would be pretty > > easy, since iirc Sqoop does generate an input format for you. > > > > Good project for someone looking to get started in contributing to Pig > > ... > > > > :) > > > > -D > > > > 2010/11/4 Anze <[email protected]> > > > > > Hi Arvind! > > > > > > Should we take this discussion off the list? It is not really > > > Pig-related anymore... Not sure what the custom is around here. :) > > > > > > > > process. Apart from just the speed, Sqoop offers many other > > > > advantages > > > > > > > too such as incremental loads, exporting data from HDFS back to the > > > > > database, automatic creation of Hive tables or populating hbase > > > > > etc. > > > > > > Only Pig is missing then... >:-D > > > Sorry, couldn't hold that back... ;) > > > > > > I would love to use Sqoop for another task (periodically importing > > > MySQL tables to HBase) if schema gets more or less preserved, however > > > I don't dare > > > upgrade JRE to JDK at the moment in fear of breaking things. > > > > > > > Anze - I just checked that our Sqoop packages do declare the JDK > > > > dependency. Which package did you see as not having this dependency? > > > > > > We are using: > > > ----- > > > deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib > > > ----- > > > But there is no sqoop package per se, I guess it is part of hadoop > > > > package: > > > ----- > > > $ aptitude show hadoop-0.20 | grep Depends > > > Depends: adduser, sun-java6-jre, sun-java6-bin > > > ----- > > > $ aptitude search sun-java6 | grep "jdk\|jre" > > > p sun-java6-jdk - Sun Java(TM) Development Kit > > > (JDK) > > > > 6 > > > > > i A sun-java6-jre - Sun Java(TM) Runtime Environment > > > (JRE) 6 > > > ----- > > > > > > > > > This is where aa...@cloudera advises that JDK is needed (instead of > > > JRE) for > > > > > successful running of sqoop: > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio > > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep > > tion- > > > > > j7ziz< > > > > http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exceptio > > n_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexcep > > tion-%0Aj7ziz > > > > > As I said, I am interested in Sqoop (and alternatives) as we will be > > > > facing > > > > > the problem in near future, so I appreciate your involvement in this > > > thread! > > > > > > Anze > > > > > > On Thursday 04 November 2010, [email protected] wrote: > > > > Anze - I just checked that our Sqoop packages do declare the JDK > > > > dependency. Which package did you see as not having this dependency? > > > > > > > > Arvind > > > > > > > > On Thu, Nov 4, 2010 at 9:25 AM, [email protected] > > > > > > <[email protected]>wrote: > > > > > Sqoop is Java based and you should have JDK 1.6 or higher available > > > > on > > > > > > > your system. We will add this as a dependency for the package. > > > > > > > > > > Regarding accessing MySQL from a cluster - it should not be a > > > > > problem > > > > > > if > > > > > > > > you control the number of tasks that do that. Sqoop allows you to > > > > > explicitly specify the number of mappers, where each mapper holds a > > > > > a connection to the database and effectively parallelizes the > > > > > loading process. Apart from just the speed, Sqoop offers many > > > > > other > > > > advantages > > > > > > > too such as incremental loads, exporting data from HDFS back to the > > > > > database, automatic creation of Hive tables or populating hbase > > > > > etc. > > > > > > > > > > Arvind > > > > > > > > > > 2010/11/4 Anze <[email protected]> > > > > > > > > > >> So Sqoop doesn't require JDK? > > > > >> It seemed weird to me too. Also, if it would require it, then JDK > > > > > > would > > > > > > > >> probably have to be among dependencies of the package Sqoop is in. > > > > >> > > > > >> I started working on DBLoader, but the learning curve seems quite > > > > > > steep > > > > > > > >> and I > > > > >> don't have enough time for it right now. Also, as Ankur said, it > > > > might > > > > > > >> not be > > > > >> a good idea to hit MySQL from the cluster. > > > > >> > > > > >> The ideal solution IMHO would be loading data from MySQL to HDFS > > > > from > > > > > a > > > > > > > >> single > > > > >> machine (but within LoadFunc, of course) and work with the data > > > > >> from there (with schema automatically converted from MySQL). But > > > > >> I don't know enough about Pig to do that kind of thing... yet. :) > > > > >> > > > > >> Anze > > > > >> > > > > >> On Wednesday 03 November 2010, [email protected] wrote: > > > > >> > Sorry that you ran into a problem. Typically, it is usually > > > > > > something > > > > > > > >> like > > > > >> > > > > >> > missing a required option etc that could cause this and if you > > > > were > > > > > to > > > > > > > >> send > > > > >> > > > > >> > a mail to [email protected], you would get prompt > > > > assistance. > > > > > > >> > Regardless, if you still have any use cases like this, I will be > > > > > > glad > > > > > > > >> > to help you out in using Sqoop for that purpose. > > > > >> > > > > > >> > Arvind > > > > >> > > > > > >> > 2010/11/3 Anze <[email protected]> > > > > >> > > > > > >> > > I tried to run it, got NullPointerException, searched the net, > > > > > > found > > > > > > > >> > > Sqoop requires JDK (instead of JRE) and gave up. I am working > > > > >> > > on > > > > a > > > > > > >> > > production cluster - so I'd rather not upgrade to JDK if not > > > > >> > > > > >> necessary. > > > > >> > > > > >> > > :) > > > > >> > > > > > > >> > > But I was able export MySQL with a simple bash script: > > > > >> > > ********** > > > > >> > > #!/bin/bash > > > > >> > > > > > > >> > > MYSQL_TABLES=( table1 table2 table3 ) > > > > >> > > WHERE=/home/hadoop/pig > > > > >> > > > > > > >> > > for i in ${mysql_tabl...@]} > > > > >> > > do > > > > >> > > > > > > >> > > mysql -BAN -h <mysql_host> -u <username> --password=<pass> > > > > >> > > <database> > > > > >> > > > > >> \ > > > > >> > > > > >> > > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv > > > > >> > > > > > > >> > > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ > > > > >> > > rm $WHERE/$i.csv > > > > >> > > > > > > >> > > done > > > > >> > > ********** > > > > >> > > > > > > >> > > Of course, in my case the tables were small enough so I could > > > > >> > > do > > > > > > it. > > > > > > > >> And > > > > >> > > > > >> > > of course I lost schema in process. > > > > >> > > > > > > >> > > Hope it helps someone else too... > > > > >> > > > > > > >> > > Anze > > > > >> > > > > > > >> > > On Wednesday 03 November 2010, [email protected] wrote: > > > > >> > > > Anze, > > > > >> > > > > > > > >> > > > Did you get a chance to try out Sqoop? If not, I would > > > > encourage > > > > > > >> > > > you > > > > >> > > > > >> to > > > > >> > > > > >> > > do > > > > >> > > > > > > >> > > > so. Here is a link to the user > > > > >> > > > guide< > > > > > > http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > > > > > > > >> > > > . > > > > >> > > > > > > > >> > > > Sqoop allows you to easily move data across from relational > > > > >> > > > > >> databases > > > > >> > > > > >> > > > and other enterprise systems to HDFS and back. > > > > >> > > > > > > > >> > > > Arvind > > > > >> > > > > > > > >> > > > 2010/11/3 Anze <[email protected]> > > > > >> > > > > > > > >> > > > > Alejandro, thanks for answering! > > > > >> > > > > > > > > >> > > > > I was hoping it could be done directly from Pig, but... :) > > > > >> > > > > > > > > >> > > > > I'll take a look at Sqoop then, and if that doesn't help, > > > > I'll > > > > > > >> just > > > > >> > > > > >> > > write > > > > >> > > > > > > >> > > > > a simple batch to export data to TXT/CSV. Thanks for the > > > > >> > > > > pointer! > > > > >> > > > > > > > > >> > > > > Anze > > > > >> > > > > > > > > >> > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > > > > >> > > > > > Not a 100% Pig solution, but you could use Sqoop to get > > > > the > > > > > > >> > > > > > data > > > > >> > > > > >> in > > > > >> > > > > >> > > as > > > > >> > > > > > > >> > > > > > a pre-processing step. And if you want to handle all as > > > > > > single > > > > > > > >> job, > > > > >> > > > > >> > > > > > you > > > > >> > > > > > > > > >> > > > > could > > > > >> > > > > > > > > >> > > > > > use Oozie to create a workflow that does Sqoop and then > > > > your > > > > > > >> > > > > > Pig processing. > > > > >> > > > > > > > > > >> > > > > > Alejandro > > > > >> > > > > > > > > > >> > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze > > > > >> > > > > > <[email protected]> > > > > >> > > > > >> wrote: > > > > >> > > > > > > Hi! > > > > >> > > > > > > > > > > >> > > > > > > Part of data I have resides in MySQL. Is there a > > > > >> > > > > > > loader > > > > > > that > > > > > > > >> > > > > > > would > > > > >> > > > > > > > > >> > > > > allow > > > > >> > > > > > > > > >> > > > > > > loading directly from it? > > > > >> > > > > > > > > > > >> > > > > > > I can't find anything on the net, but it seems to me > > > > this > > > > > > >> > > > > > > must > > > > >> > > > > >> be > > > > >> > > > > >> > > > > > > a > > > > >> > > > > > > > > >> > > > > quite > > > > >> > > > > > > > > >> > > > > > > common problem. > > > > >> > > > > > > I checked piggybank but there is only DBStorage (and > > > > >> > > > > > > no DBLoader). > > > > >> > > > > > > > > > > >> > > > > > > Is some DBLoader out there too? > > > > >> > > > > > > > > > > >> > > > > > > Thanks, > > > > >> > > > > > > > > > > >> > > > > > > Anze
