Hi Krishnan Thanks for your perfect explanation.
In that case, can I use sql command to filter data from PostgreSQL and push to hive? Because every time if I delete the HDFS folder and import, it will import 1.12GB data back to hadoop from postgreSQL. I am tying to minimize the network traffic and hope you can understand my requirement. Thanks again. On Mon, Apr 16, 2012 at 1:49 PM, Krishnan K <[email protected]> wrote: > Hi Roshan, > > Yes, because the sqoop first pulls the data into HDFS before pushing it > into the hive table. > "FileAlreadyExistsException" occurs at the HDFS level and not at the > hive-end. > > You may want to write a shell script that -- > 1) first deletes the existing directory in HDFS and then > 2) execute the sqoop command > > You can now run the shell script to perform your import operation from > PostgreSQL to Hive. > > Regards, > Krishnan > > > > On Mon, Apr 16, 2012 at 6:39 AM, Roshan Pradeep <[email protected]>wrote: > >> Hi >> >> Thanks for the quick reply. >> >> Regarding my incremental import, do I need to remove the directory every >> time when I want to re-import the updated data from PostgreSQL? >> >> Because I am working with some GB size data and want to confirm this. >> >> Thanks. >> >> >> On Mon, Apr 16, 2012 at 9:48 AM, Krishnan K <[email protected]>wrote: >> >>> Hi Roshan, >>> >>> If you have run the sqoop command once and even if it fails, it creates >>> an output directory in HDFS. >>> You can delete this folder (users) and then try running this sqoop >>> command once again - >>> >>> *hadoop dfs -rmr users* and execute this command. >>> >>> Sqoop first imports the data from PostGRE into HDFS and >>> then moves into the hive default directory >>> (/user/hive/warehouse/<tablename>) >>> >>> For sqoop to copy this first into HDFS, you must ensure that it does not >>> already exist. >>> >>> -Krishnan >>> >>> >>> On Mon, Apr 16, 2012 at 4:49 AM, Roshan Pradeep <[email protected]>wrote: >>> >>>> Hi All >>>> >>>> I want to import the updated data from my source (PostgreSQL) to hive >>>> based on a column (lastmodifiedtime) in postgreSQL >>>> >>>> *The command I am using* >>>> >>>> /app/sqoop/bin/sqoop import --hive-table users --connect >>>> jdbc:postgresql:/<server_url>/<database> --table users --username XXXXXXX >>>> --password YYYYYY --hive-home /app/hive --hive-import --incremental >>>> lastmodified --check-column lastmodifiedtime >>>> >>>> With the above command, I am getting the below error >>>> >>>> 12/04/13 16:31:21 INFO orm.CompilationManager: Writing jar file: >>>> /tmp/sqoop-root/compile/11ce8600a5656ed49e631a260c387692/users.jar >>>> 12/04/13 16:31:21 INFO tool.ImportTool: Incremental import based on >>>> column "lastmodifiedtime" >>>> 12/04/13 16:31:21 INFO tool.ImportTool: Upper bound value: '2012-04-13 >>>> 16:31:21.865429' >>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: It looks like you are >>>> importing from postgresql. >>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: This transfer can be >>>> faster! Use the --direct >>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: option to exercise a >>>> postgresql-specific fast path. >>>> 12/04/13 16:31:21 INFO mapreduce.ImportJobBase: Beginning import of >>>> users >>>> 12/04/13 16:31:23 ERROR tool.ImportTool: Encountered IOException >>>> running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: >>>> Output directory users already exists >>>> at >>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123) >>>> at >>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770) >>>> at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) >>>> at >>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) >>>> at >>>> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141) >>>> at >>>> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:201) >>>> at >>>> org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:413) >>>> at >>>> org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:102) >>>> at >>>> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380) >>>> at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453) >>>> at org.apache.sqoop.Sqoop.run(Sqoop.java:145) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) >>>> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) >>>> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) >>>> at org.apache.sqoop.Sqoop.main(Sqoop.java:238) >>>> at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57) >>>> >>>> According to the above stack trace, sqoop it identify the updated data >>>> from postgreSQL, but it says output directory already exists. Could someone >>>> please help me to correct this issue. >>>> >>>> *I am using * >>>> >>>> Hadoop - 0.20.2 >>>> Hive - 0.8.1 >>>> Sqoop - 1.4.1-incubating >>>> >>>> >>>> Thanks. >>>> >>> >>> >> >
