Hi Krishnan

Thanks for your perfect explanation.

In that case, can I use sql command to filter data from PostgreSQL and push
to hive? Because every time if I delete the HDFS folder and import, it will
import 1.12GB data back to hadoop from postgreSQL.

I am tying to minimize the network traffic and hope you can understand my
requirement.

Thanks again.

On Mon, Apr 16, 2012 at 1:49 PM, Krishnan K <[email protected]> wrote:

> Hi Roshan,
>
> Yes, because the sqoop first pulls the data into HDFS before pushing it
> into the hive table.
> "FileAlreadyExistsException" occurs at the HDFS level and not at the
> hive-end.
>
> You may want to write a shell script that --
> 1) first deletes the existing directory in HDFS and then
> 2) execute the sqoop command
>
> You can now run the shell script to perform your import operation from
> PostgreSQL to Hive.
>
> Regards,
> Krishnan
>
>
>
> On Mon, Apr 16, 2012 at 6:39 AM, Roshan Pradeep <[email protected]>wrote:
>
>> Hi
>>
>> Thanks for the quick reply.
>>
>> Regarding my incremental import, do I need to remove the directory every
>> time when I want to re-import the updated data from PostgreSQL?
>>
>> Because I am working with some GB size data and want to confirm this.
>>
>> Thanks.
>>
>>
>> On Mon, Apr 16, 2012 at 9:48 AM, Krishnan K <[email protected]>wrote:
>>
>>> Hi Roshan,
>>>
>>> If you have run the sqoop command once and even if it fails, it creates
>>> an output directory in HDFS.
>>> You can delete this folder (users) and then try running this sqoop
>>> command once again -
>>>
>>> *hadoop dfs -rmr users* and execute this command.
>>>
>>> Sqoop first imports the data from PostGRE into HDFS and
>>> then moves into the hive default directory
>>> (/user/hive/warehouse/<tablename>)
>>>
>>> For sqoop to copy this first into HDFS, you must ensure that it does not
>>> already exist.
>>>
>>> -Krishnan
>>>
>>>
>>> On Mon, Apr 16, 2012 at 4:49 AM, Roshan Pradeep <[email protected]>wrote:
>>>
>>>> Hi All
>>>>
>>>> I want to import the updated data from my source (PostgreSQL) to hive
>>>> based on a column (lastmodifiedtime) in postgreSQL
>>>>
>>>> *The command I am using*
>>>>
>>>> /app/sqoop/bin/sqoop import --hive-table users --connect
>>>> jdbc:postgresql:/<server_url>/<database> --table users --username XXXXXXX
>>>> --password YYYYYY --hive-home /app/hive --hive-import --incremental
>>>> lastmodified --check-column lastmodifiedtime
>>>>
>>>> With the above command, I am getting the below error
>>>>
>>>> 12/04/13 16:31:21 INFO orm.CompilationManager: Writing jar file:
>>>> /tmp/sqoop-root/compile/11ce8600a5656ed49e631a260c387692/users.jar
>>>> 12/04/13 16:31:21 INFO tool.ImportTool: Incremental import based on
>>>> column "lastmodifiedtime"
>>>> 12/04/13 16:31:21 INFO tool.ImportTool: Upper bound value: '2012-04-13
>>>> 16:31:21.865429'
>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: It looks like you are
>>>> importing from postgresql.
>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: This transfer can be
>>>> faster! Use the --direct
>>>> 12/04/13 16:31:21 WARN manager.PostgresqlManager: option to exercise a
>>>> postgresql-specific fast path.
>>>> 12/04/13 16:31:21 INFO mapreduce.ImportJobBase: Beginning import of
>>>> users
>>>> 12/04/13 16:31:23 ERROR tool.ImportTool: Encountered IOException
>>>> running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
>>>> Output directory users already exists
>>>>         at
>>>> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:123)
>>>>         at
>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:770)
>>>>         at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>>>>         at
>>>> org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>>>>         at
>>>> org.apache.sqoop.mapreduce.ImportJobBase.runJob(ImportJobBase.java:141)
>>>>         at
>>>> org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:201)
>>>>         at
>>>> org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:413)
>>>>         at
>>>> org.apache.sqoop.manager.PostgresqlManager.importTable(PostgresqlManager.java:102)
>>>>         at
>>>> org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:380)
>>>>         at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:453)
>>>>         at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
>>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>         at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
>>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
>>>>         at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
>>>>         at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
>>>>         at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
>>>>
>>>> According to the above stack trace, sqoop it identify the updated data
>>>> from postgreSQL, but it says output directory already exists. Could someone
>>>> please help me to correct this issue.
>>>>
>>>> *I am using *
>>>>
>>>> Hadoop - 0.20.2
>>>> Hive - 0.8.1
>>>> Sqoop - 1.4.1-incubating
>>>>
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Reply via email to