Hi Arvind! Should we take this discussion off the list? It is not really Pig-related anymore... Not sure what the custom is around here. :)
> > process. Apart from just the speed, Sqoop offers many other advantages > > too such as incremental loads, exporting data from HDFS back to the > > database, automatic creation of Hive tables or populating hbase etc. Only Pig is missing then... >:-D Sorry, couldn't hold that back... ;) I would love to use Sqoop for another task (periodically importing MySQL tables to HBase) if schema gets more or less preserved, however I don't dare upgrade JRE to JDK at the moment in fear of breaking things. > Anze - I just checked that our Sqoop packages do declare the JDK > dependency. Which package did you see as not having this dependency? We are using: ----- deb http://archive.cloudera.com/debian lenny-cdh3b1 contrib ----- But there is no sqoop package per se, I guess it is part of hadoop package: ----- $ aptitude show hadoop-0.20 | grep Depends Depends: adduser, sun-java6-jre, sun-java6-bin ----- $ aptitude search sun-java6 | grep "jdk\|jre" p sun-java6-jdk - Sun Java(TM) Development Kit (JDK) 6 i A sun-java6-jre - Sun Java(TM) Runtime Environment (JRE) 6 ----- This is where aa...@cloudera advises that JDK is needed (instead of JRE) for successful running of sqoop: http://getsatisfaction.com/cloudera/topics/error_sqoop_sqoop_got_exception_running_sqoop_java_lang_nullpointerexception_java_lang_nullpointerexception- j7ziz As I said, I am interested in Sqoop (and alternatives) as we will be facing the problem in near future, so I appreciate your involvement in this thread! Anze On Thursday 04 November 2010, [email protected] wrote: > Anze - I just checked that our Sqoop packages do declare the JDK > dependency. Which package did you see as not having this dependency? > > Arvind > > On Thu, Nov 4, 2010 at 9:25 AM, [email protected] <[email protected]>wrote: > > Sqoop is Java based and you should have JDK 1.6 or higher available on > > your system. We will add this as a dependency for the package. > > > > Regarding accessing MySQL from a cluster - it should not be a problem if > > you control the number of tasks that do that. Sqoop allows you to > > explicitly specify the number of mappers, where each mapper holds a a > > connection to the database and effectively parallelizes the loading > > process. Apart from just the speed, Sqoop offers many other advantages > > too such as incremental loads, exporting data from HDFS back to the > > database, automatic creation of Hive tables or populating hbase etc. > > > > Arvind > > > > 2010/11/4 Anze <[email protected]> > > > >> So Sqoop doesn't require JDK? > >> It seemed weird to me too. Also, if it would require it, then JDK would > >> probably have to be among dependencies of the package Sqoop is in. > >> > >> I started working on DBLoader, but the learning curve seems quite steep > >> and I > >> don't have enough time for it right now. Also, as Ankur said, it might > >> not be > >> a good idea to hit MySQL from the cluster. > >> > >> The ideal solution IMHO would be loading data from MySQL to HDFS from a > >> single > >> machine (but within LoadFunc, of course) and work with the data from > >> there (with schema automatically converted from MySQL). But I don't > >> know enough about Pig to do that kind of thing... yet. :) > >> > >> Anze > >> > >> On Wednesday 03 November 2010, [email protected] wrote: > >> > Sorry that you ran into a problem. Typically, it is usually something > >> > >> like > >> > >> > missing a required option etc that could cause this and if you were to > >> > >> send > >> > >> > a mail to [email protected], you would get prompt assistance. > >> > > >> > Regardless, if you still have any use cases like this, I will be glad > >> > to help you out in using Sqoop for that purpose. > >> > > >> > Arvind > >> > > >> > 2010/11/3 Anze <[email protected]> > >> > > >> > > I tried to run it, got NullPointerException, searched the net, found > >> > > Sqoop requires JDK (instead of JRE) and gave up. I am working on a > >> > > production cluster - so I'd rather not upgrade to JDK if not > >> > >> necessary. > >> > >> > > :) > >> > > > >> > > But I was able export MySQL with a simple bash script: > >> > > ********** > >> > > #!/bin/bash > >> > > > >> > > MYSQL_TABLES=( table1 table2 table3 ) > >> > > WHERE=/home/hadoop/pig > >> > > > >> > > for i in ${mysql_tabl...@]} > >> > > do > >> > > > >> > > mysql -BAN -h <mysql_host> -u <username> --password=<pass> > >> > > <database> > >> > >> \ > >> > >> > > -e "select * from $i;" --skip-column-names > $WHERE/$i.csv > >> > > > >> > > hadoop fs -copyFromLocal $WHERE/$i.csv /pig/mysql/ > >> > > rm $WHERE/$i.csv > >> > > > >> > > done > >> > > ********** > >> > > > >> > > Of course, in my case the tables were small enough so I could do it. > >> > >> And > >> > >> > > of course I lost schema in process. > >> > > > >> > > Hope it helps someone else too... > >> > > > >> > > Anze > >> > > > >> > > On Wednesday 03 November 2010, [email protected] wrote: > >> > > > Anze, > >> > > > > >> > > > Did you get a chance to try out Sqoop? If not, I would encourage > >> > > > you > >> > >> to > >> > >> > > do > >> > > > >> > > > so. Here is a link to the user > >> > > > guide<http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html> > >> > > > . > >> > > > > >> > > > Sqoop allows you to easily move data across from relational > >> > >> databases > >> > >> > > > and other enterprise systems to HDFS and back. > >> > > > > >> > > > Arvind > >> > > > > >> > > > 2010/11/3 Anze <[email protected]> > >> > > > > >> > > > > Alejandro, thanks for answering! > >> > > > > > >> > > > > I was hoping it could be done directly from Pig, but... :) > >> > > > > > >> > > > > I'll take a look at Sqoop then, and if that doesn't help, I'll > >> > >> just > >> > >> > > write > >> > > > >> > > > > a simple batch to export data to TXT/CSV. Thanks for the > >> > > > > pointer! > >> > > > > > >> > > > > Anze > >> > > > > > >> > > > > On Wednesday 03 November 2010, Alejandro Abdelnur wrote: > >> > > > > > Not a 100% Pig solution, but you could use Sqoop to get the > >> > > > > > data > >> > >> in > >> > >> > > as > >> > > > >> > > > > > a pre-processing step. And if you want to handle all as single > >> > >> job, > >> > >> > > > > > you > >> > > > > > >> > > > > could > >> > > > > > >> > > > > > use Oozie to create a workflow that does Sqoop and then your > >> > > > > > Pig processing. > >> > > > > > > >> > > > > > Alejandro > >> > > > > > > >> > > > > > On Wed, Nov 3, 2010 at 3:22 PM, Anze <[email protected]> > >> > >> wrote: > >> > > > > > > Hi! > >> > > > > > > > >> > > > > > > Part of data I have resides in MySQL. Is there a loader that > >> > > > > > > would > >> > > > > > >> > > > > allow > >> > > > > > >> > > > > > > loading directly from it? > >> > > > > > > > >> > > > > > > I can't find anything on the net, but it seems to me this > >> > > > > > > must > >> > >> be > >> > >> > > > > > > a > >> > > > > > >> > > > > quite > >> > > > > > >> > > > > > > common problem. > >> > > > > > > I checked piggybank but there is only DBStorage (and no > >> > > > > > > DBLoader). > >> > > > > > > > >> > > > > > > Is some DBLoader out there too? > >> > > > > > > > >> > > > > > > Thanks, > >> > > > > > > > >> > > > > > > Anze
