Re: Sqoop export from Hive table stored as Parquet

Markus Kemper Wed, 26 Oct 2016 16:26:36 -0700

Unfortunately Venkat is correct about (import + --hcatalog with parquet)
however just in case others following this thread are not aware are, it is
possible to import Parquet data into Hive, for example:


$ sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD
--table t1 --target-dir /user/<hdfs_user>/t1 --delete-target-dir
--hive-import --hive-database default --hive-table t1_parquet --num-mappers
1 --as-parquetfile

$ sqoop import --connect $MYCONN --username $MYUSER --password $MYPSWD
--table t1 --target-dir /user/hive/warehouse/t1_parquet --num-mappers 1
--as-parquetfile --append

$ impala-shell -i "<impala_daemon_host>:21000" -q "use default; select
c_int, from_unixtime(cast(c_date/1000 as bigint), 'yyyy-MM-dd') as c_date,
from_unixtime(cast(c_timestamp/1000 as bigint), 'yyyy-MM-dd HH:mm:ss.S') as
c_timestamp from t1_parquet"

+-------+------------+-----------------------+
| c_int | c_date     | c_timestamp           |
+-------+------------+-----------------------+
| 1     | 2016-10-26 | 2016-10-26 21:30:33.0 |
| 1     | 2016-10-26 | 2016-10-26 21:30:33.0 |
+-------+------------+-----------------------+



Markus Kemper
Customer Operations Engineer
[image: www.cloudera.com] <http://www.cloudera.com>


On Tue, Oct 25, 2016 at 4:35 PM, Venkat Ranganathan <
[email protected]> wrote:

> Thanks for testing and answering that the export works.   One thing to
> note is that imports are not allowed into parquet tables with HCatalog (See
> HIVE-7502).   It has been a long standing issue since we created the Sqoop
> HCatalog integration a few years ago L
>
>
>
> Thanks
>
>
> Venkat
>
>
>
> *From: *Markus Kemper <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, October 25, 2016 at 12:01 PM
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Sqoop export from Hive table stored as Parquet
>
>
>
> Awesomeness and thank you for helping with the other forums.
>
>
>
>
> Markus Kemper
> Customer Operations Engineer
> [image: www.cloudera.com] <http://www.cloudera.com>
>
>
>
>
>
> On Tue, Oct 25, 2016 at 2:56 PM, Douglas Spadotto <[email protected]>
> wrote:
>
> Hi Markus,
>
>
>
> It worked fine end to end. Here it goes the edited output:
>
>
>
> *SOURCE:*
>
>
>
> hive> describe extended teste1;
>
> OK
>
> id                   int
>
> nome                 string
>
>
>
> Detailed Table Information Table(tableName:teste1, dbName:default,
> owner:cloudera, createTime:1477408342, lastAccessTime:0, retention:0,
> sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null),
> FieldSchema(name:nome, type:string, comment:null)],
> location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/teste1, 
> *inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,
> outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat,*
> compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
> serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe,
> parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
> parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[],
> skewedColValues:[], skewedColValueLocationMaps:{}),
> storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=2,
> transient_lastDdlTime=1477408508, COLUMN_STATS_ACCURATE=true,
> totalSize=645, numRows=3, rawDataSize=6}, viewOriginalText:null,
> viewExpandedText:null, tableType:MANAGED_TABLE)
>
> Time taken: 0.08 seconds, Fetched: 4 row(s)
>
>
>
> hive> select * from teste1;
>
> OK
>
> ...
>
> 1 Douglas
>
> 2 Spadotto
>
> 3 Doug
>
> Time taken: 0.323 seconds, Fetched: 3 row(s)
>
>
>
> *SQOOP:*
>
>
>
> cloudera@quickstart ~]$ sqoop export --connect
> jdbc:postgresql://localhost/postgres --table test1 --hcatalog-table
> teste1
>
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will
> fail.
>
> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
>
> 16/10/25 11:42:19 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
>
> 16/10/25 11:42:19 INFO manager.SqlManager: Using default fetchSize of 1000
>
> 16/10/25 11:42:19 INFO tool.CodeGenTool: Beginning code generation
>
> 16/10/25 11:42:19 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
>
> 16/10/25 11:42:19 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
> /usr/lib/hadoop-mapreduce
>
> Note: /tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.java
> uses or overrides a deprecated API.
>
> Note: Recompile with -Xlint:deprecation for details.
>
> 16/10/25 11:42:20 INFO orm.CompilationManager: Writing jar file:
> /tmp/sqoop-cloudera/compile/50329ee88534a5c3b6df8b37aadf2131/test1.jar
>
> 16/10/25 11:42:20 INFO mapreduce.ExportJobBase: Beginning export of test1
>
> 16/10/25 11:42:21 INFO Configuration.deprecation: mapred.jar is
> deprecated. Instead, use mapreduce.job.jar
>
> 16/10/25 11:42:21 INFO Configuration.deprecation: mapred.map.max.attempts
> is deprecated. Instead, use mapreduce.map.maxattempts
>
> 16/10/25 11:42:21 INFO mapreduce.ExportJobBase: Configuring HCatalog for
> export job
>
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Configuring HCatalog
> specific details for job
>
> 16/10/25 11:42:21 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
>
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column names
> projected : [id, nome]
>
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: Database column name -
> info map :
>
> id : [Type : 4,Precision : 10,Scale : 0]
>
> nome : [Type : 12,Precision : 50,Scale : 0]  *<-- Got the structure here!*
>
>
>
> 16/10/25 11:42:21 INFO hive.metastore: Trying to connect to metastore with
> URI thrift://quickstart.cloudera:9083
>
> 16/10/25 11:42:21 INFO hive.metastore: Opened a connection to metastore,
> current connections: 1
>
> 16/10/25 11:42:21 INFO hive.metastore: Connected to metastore.
>
> 16/10/25 11:42:21 INFO hcat.SqoopHCatUtilities: HCatalog full table schema
> fields = [id, nome]
>
> ...
>
> 16/10/25 11:42:22 WARN hcat.SqoopHCatUtilities: No files under
> /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers to add to
> distributed cache for hcatalog job
>
> 16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Configuring HCatalog for
> export job
>
> 16/10/25 11:42:22 INFO hcat.SqoopHCatUtilities: Ignoring configuration
> request for HCatalog info
>
> 16/10/25 11:42:22 INFO Configuration.deprecation: 
> mapred.reduce.tasks.speculative.execution
> is deprecated. Instead, use mapreduce.reduce.speculative
>
> 16/10/25 11:42:22 INFO Configuration.deprecation:
> mapred.map.tasks.speculative.execution is deprecated. Instead, use
> mapreduce.map.speculative
>
> 16/10/25 11:42:22 INFO Configuration.deprecation: mapred.map.tasks is
> deprecated. Instead, use mapreduce.job.maps
>
> 16/10/25 11:42:22 INFO client.RMProxy: Connecting to ResourceManager at
> quickstart.cloudera/192.168.26.129:8032
>
> 16/10/25 11:42:49 INFO Configuration.deprecation: mapred.input.dir is
> deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
>
> 16/10/25 11:42:49 INFO mapred.FileInputFormat: Total input paths to
> process : 2
>
> 16/10/25 11:42:50 INFO mapreduce.JobSubmitter: number of splits:2
>
> ...
>
> 16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Transferred 15.9258 KB in
> 45.1568 seconds (361.1415 bytes/sec)
>
> 16/10/25 11:43:07 INFO mapreduce.ExportJobBase: Exported 3 records.
>
> 16/10/25 11:43:07 INFO hive.metastore: Closed a connection to metastore,
> current connections: 0
>
>
>
> *DESTINATION:*
>
>
>
> [cloudera@quickstart ~]$ psql -d postgres -c "select * from test1"
>
>  id |   nome
>
> ----+----------
>
>   1 | Douglas
>
>   2 | Spadotto
>
>   3 | Doug
>
> (3 rows)
>
>
>
> Now I'll go off into the Internet answer to everyone that posted this same
> question on other forums. :)
>
>
>
> Regards,
>
>
>
> Douglas
>
>
>
>
>
> On Tue, Oct 25, 2016 at 4:00 PM, Markus Kemper <[email protected]>
> wrote:
>
> Glad to hear it.  Let us know how it goes
>
>
>
>
> Markus Kemper
> Customer Operations Engineer
> [image: www.cloudera.com] <http://www.cloudera.com>
>
>
>
>
>
> On Tue, Oct 25, 2016 at 1:55 PM, Douglas Spadotto <[email protected]>
> wrote:
>
> Hi Markus,
>
>
>
> Thank you!
>
>
>
> I tried this (hcatalog options) myself a few minutes after I hit "send" on
> the e-mail. It worked fine, Sqoop was able to read the Parquet structure.
> Just my MR crashed but it was due to my unstable environment.
>
>
>
> It looks like I'm on the way to the solution.
>
>
>
> Cheers,
>
>
>
> Douglas
>
>
>
> On Tue, Oct 25, 2016 at 3:32 PM, Markus Kemper <[email protected]>
> wrote:
>
> Hello Douglas,
>
>
>
> The only workaround that I am aware of is to use the Sqoop --hcatalog
> options, for example:
>
> sqoop export --connect <jdbc_connection_string> --table <rdbms_table>
> --hcatalog-database <hive_database> --hcatalog-table <hive_table>
>
>
>
>
>
>
> Markus Kemper
> Customer Operations Engineer
> [image: www.cloudera.com] <http://www.cloudera.com>
>
>
>
>
>
> On Tue, Oct 25, 2016 at 1:21 PM, Douglas Spadotto <[email protected]>
> wrote:
>
> Hello everyone,
>
>
>
> I saw in the past few months quite a few messages about Parquet support on
> Sqoop, all about importing. Some of them worked well.
>
>
>
> But for exporting I'm receiving this error when trying to export from a
> Hive table stored as Parquet to Postgresql:
>
>
>
> [cloudera@quickstart ~]$ sqoop export --connect
> jdbc:postgresql://localhost/postgres --table test1  --export-dir
> /user/hive/warehouse/teste1
>
> Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will
> fail.
>
> Please set $ACCUMULO_HOME to the root of your Accumulo installation.
>
> 16/10/25 09:19:09 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.8.0
>
> 16/10/25 09:19:09 INFO manager.SqlManager: Using default fetchSize of 1000
>
> 16/10/25 09:19:09 INFO tool.CodeGenTool: Beginning code generation
>
> 16/10/25 09:19:10 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
>
> 16/10/25 09:19:10 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is
> /usr/lib/hadoop-mapreduce
>
> Note: /tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.java
> uses or overrides a deprecated API.
>
> Note: Recompile with -Xlint:deprecation for details.
>
> 16/10/25 09:19:11 INFO orm.CompilationManager: Writing jar file:
> /tmp/sqoop-cloudera/compile/019c3435216213411e2de14c483af692/test1.jar
>
> 16/10/25 09:19:11 INFO mapreduce.ExportJobBase: Beginning export of test1
>
> 16/10/25 09:19:12 INFO Configuration.deprecation: mapred.jar is
> deprecated. Instead, use mapreduce.job.jar
>
> 16/10/25 09:19:12 INFO Configuration.deprecation: mapred.map.max.attempts
> is deprecated. Instead, use mapreduce.map.maxattempts
>
> 16/10/25 09:19:13 INFO manager.SqlManager: Executing SQL statement: SELECT
> t.* FROM "test1" AS t LIMIT 1
>
> 16/10/25 09:19:13 ERROR sqoop.Sqoop: Got exception running Sqoop:
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not
> exist: hdfs://quickstart.cloudera:8020/user/hive/warehouse/
> teste1/.metadata
>
> org.kitesdk.data.DatasetNotFoundException: Descriptor location does not
> exist: hdfs://quickstart.cloudera:8020/user/hive/warehouse/
> teste1/.metadata
>
> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.checkExists(
> FileSystemMetadataProvider.java:562)
>
> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.find(
> FileSystemMetadataProvider.java:605)
>
> at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.load(
> FileSystemMetadataProvider.java:114)
>
> at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(
> FileSystemDatasetRepository.java:197)
>
> at org.kitesdk.data.Datasets.load(Datasets.java:108)
>
> at org.kitesdk.data.Datasets.load(Datasets.java:140)
>
> at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$
> ConfigBuilder.readFrom(DatasetKeyInputFormat.java:92)
>
> at org.kitesdk.data.mapreduce.DatasetKeyInputFormat$
> ConfigBuilder.readFrom(DatasetKeyInputFormat.java:139)
>
> at org.apache.sqoop.mapreduce.JdbcExportJob.configureInputFormat(
> JdbcExportJob.java:84)
>
> at org.apache.sqoop.mapreduce.ExportJobBase.runExport(
> ExportJobBase.java:432)
>
> at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
>
> at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
>
> at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
>
> at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
>
> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
>
> at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
>
> at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
>
>
>
> I saw a recent JIRA opened about this, https://issues.apache.
> org/jira/browse/SQOOP-2907, and am wondering if there is any workaround
> for this?
>
>
>
> Thanks in advance,
>
>
>
> Douglas
>
>
>
> --
>
> Visite: http://canseidesercowboy.wordpress.com/
>
> Siga: @dougspadotto ou @excowboys
>
> -----
>
> Frodo: "I wish none of this had happened."
>
> Gandalf: "So do all who live to see such times, but that is not for them
> to decide. All we have to decide is what to do with the time that is given
> to us."
>
> -- Lord of the Rings: The Fellowship of the Ring (2001)
>
>
>
>
>
>
>
> --
>
> Visite: http://canseidesercowboy.wordpress.com/
>
> Siga: @dougspadotto ou @excowboys
>
> -----
>
> Frodo: "I wish none of this had happened."
>
> Gandalf: "So do all who live to see such times, but that is not for them
> to decide. All we have to decide is what to do with the time that is given
> to us."
>
> -- Lord of the Rings: The Fellowship of the Ring (2001)
>
>
>
>
>
>
>
> --
>
> Visite: http://canseidesercowboy.wordpress.com/
>
> Siga: @dougspadotto ou @excowboys
>
> -----
>
> Frodo: "I wish none of this had happened."
>
> Gandalf: "So do all who live to see such times, but that is not for them
> to decide. All we have to decide is what to do with the time that is given
> to us."
>
> -- Lord of the Rings: The Fellowship of the Ring (2001)
>
>
>

Re: Sqoop export from Hive table stored as Parquet

Reply via email to