Hi Igor, Hive can read from zipped files. If you are getting a lot of external files it makes sense to zip them and store on staging hdfs directory
1) download say these csv files into your local file system and use bzip2 to zip them as part of ETL ls -l total 68 -rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2 -rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2 -rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2 -rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2 -rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2 -rw-r--r-- 1 hduser hadoop 693 Apr 25 11:29 nw_2016.csv.bz2 Then put these files in a staging directory on hdfs usinh a shell script for FILE in `ls *.*|grep -v .ksh` do echo "Bzipping ${FILE}" /usr/bin/bzip2 ${FILE} hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR} done OK now the files are saved in ${TARGETDIR} Now create the external table looking at this staging directory. *No need to tell hive that these files are compressed*. It knows how to handle it. They are stored as textfiles DROP TABLE IF EXISTS stg_t2; CREATE EXTERNAL TABLE stg_t2 ( INVOICENUMBER string ,PAYMENTDATE string ,NET string ,VAT string ,TOTAL string ) COMMENT 'from csv file from excel sheet nw_10124772' ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' *STORED AS TEXTFILE*LOCATION '/data/stg/accounts/nw/10124772' TBLPROPERTIES ("skip.header.line.count"="1") Now create the Hive table internally. Note that I want this data to be compressed. You will tell it to compress the table with ZLIB or SNAPPY DROP TABLE IF EXISTS t2; CREATE TABLE t2 ( INVOICENUMBER INT ,PAYMENTDATE date ,NET DECIMAL(20,2) ,VAT DECIMAL(20,2) ,TOTAL DECIMAL(20,2) ) COMMENT 'from csv file from excel sheet nw_10124772' CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS STORED AS ORC *TBLPROPERTIES ( "orc.compress"="ZLIB" )* Put data in target table. do the conversion and ignore empty rows INSERT INTO TABLE t2 SELECT INVOICENUMBER , TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd')) AS paymentdate , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2)) , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2)) , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) FROM stg_t2 WHERE -- INVOICENUMBER > 0 AND CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0 -- Exclude empty rows ; So pretty straight forward. Now to your question "it will affect performance, correct?" Compression is a well established algorithm. It has been around in databases. Almost all RDBMS (Oracle, Sybase etc) do compress the data at database and backups through an option. Compression is more CPU intensive than without it. However, the database will handle the conversion of data from compressed to none when you read it or whatever. So yes there is a performance price to pay albeit small using more CPU to uncompress the data and present it. However, that is a small price to pay to reduce the storage cost for data. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 6 June 2016 at 23:18, Igor Kravzov <igork.ine...@gmail.com> wrote: > Mich, will Hive automatically detect and unzip zipped files? Ir there is > special option in table configuration? > it will affect performance, correct? > > On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > >> Hi Sandeep. >> >> I tend to use Hive external tables as staffing tables but still I will >> require access writes to hdfs. >> >> Zip files work OK as well. For example our CSV files are zipped using >> bzip2 to save space >> >> However, you may request a temporary solution by disabling permission in >> $HADOOP_HOME/etc/Hadoop/hdfs-site.xml >> >> <property> >> <name>dfs.permissions</name> >> <value>false</value> >> </property> >> >> There are other ways as well. >> >> Check this >> >> http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs >> >> HTH >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 6 June 2016 at 21:00, Igor Kravzov <igork.ine...@gmail.com> wrote: >> >>> I see file are with extension .gz. Are these zipped? >>> Did you try with unzipped files? >>> Maybe in order to read the data hive needs to unzip files but does not >>> have write permission? >>> Just a wild guess... >>> >>> On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri <sand...@cloudxlab.com> >>> wrote: >>> >>>> Yes, when I run hadoop fs it gives results correctly. >>>> >>>> *hadoop fs -ls >>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/* >>>> *Found 30 items* >>>> *-rw-r--r-- 3 hdfs hdfs 6148 2015-12-04 15:19 >>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store* >>>> *-rw-r--r-- 3 hdfs hdfs 803323 2015-12-04 15:19 >>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz* >>>> *-rw-r--r-- 3 hdfs hdfs 284355 2015-12-04 15:19 >>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz* >>>> *....* >>>> >>>> >>>> >>>> >>>> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> is this location correct and valid? >>>>> >>>>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/ >>>>> tweets_raw/' >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 31 May 2016 at 08:50, Sandeep Giri <sand...@cloudxlab.com> wrote: >>>>> >>>>>> Hi Hive Team, >>>>>> >>>>>> As per my understanding, in Hive, you can create two kinds of tables: >>>>>> Managed and External. >>>>>> >>>>>> In case of managed table, you own the data and hence when you drop >>>>>> the table the data is deleted. >>>>>> >>>>>> In case of external table, you don't have ownership of the data and >>>>>> hence when you delete such a table, the underlying data is not deleted. >>>>>> Only metadata is deleted. >>>>>> >>>>>> Now, recently i have observed that you can not create an external >>>>>> table over a location on which you don't have write (modification) >>>>>> permissions in HDFS. I completely fail to understand this. >>>>>> >>>>>> Use case: It is quite common that the data you are churning is huge >>>>>> and read-only. So, to churn such data via Hive, will you have to copy >>>>>> this >>>>>> huge data to a location on which you have write permissions? >>>>>> >>>>>> Please help. >>>>>> >>>>>> My data is located in a hdfs folder >>>>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/) on which I >>>>>> only have readonly permission. And I am trying to execute the following >>>>>> command >>>>>> >>>>>> *CREATE EXTERNAL TABLE tweets_raw (* >>>>>> * id BIGINT,* >>>>>> * created_at STRING,* >>>>>> * source STRING,* >>>>>> * favorited BOOLEAN,* >>>>>> * retweet_count INT,* >>>>>> * retweeted_status STRUCT<* >>>>>> * text:STRING,* >>>>>> * users:STRUCT<screen_name:STRING,name:STRING>>,* >>>>>> * entities STRUCT<* >>>>>> * urls:ARRAY<STRUCT<expanded_url:STRING>>,* >>>>>> * user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,* >>>>>> * hashtags:ARRAY<STRUCT<text:STRING>>>,* >>>>>> * text STRING,* >>>>>> * user1 STRUCT<* >>>>>> * screen_name:STRING,* >>>>>> * name:STRING,* >>>>>> * friends_count:INT,* >>>>>> * followers_count:INT,* >>>>>> * statuses_count:INT,* >>>>>> * verified:BOOLEAN,* >>>>>> * utc_offset:STRING, -- was INT but nulls are strings* >>>>>> * time_zone:STRING>,* >>>>>> * in_reply_to_screen_name STRING,* >>>>>> * year int,* >>>>>> * month int,* >>>>>> * day int,* >>>>>> * hour int* >>>>>> * )* >>>>>> * ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'* >>>>>> * WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")* >>>>>> * LOCATION >>>>>> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'* >>>>>> * ;* >>>>>> >>>>>> It throws the following error: >>>>>> >>>>>> FAILED: Execution Error, return code 1 from >>>>>> org.apache.hadoop.hive.ql.exec.DDLTask. >>>>>> MetaException(message:java.security.AccessControlException: Permission >>>>>> denied: user=sandeep, access=WRITE, >>>>>> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348) >>>>>> at >>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978) >>>>>> at >>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja >>>>>> va:1443) >>>>>> at >>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto >>>>>> s.java) >>>>>> at >>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) >>>>>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) >>>>>> at >>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151) >>>>>> at >>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147) >>>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>>>> at >>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) >>>>>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145) >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Sandeep Giri, >>>>>> +1-(347) 781-4573 (US) >>>>>> +91-953-899-8962 (IN) >>>>>> www.CloudxLab.com (A Hadoop cluster for practicing) >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Sandeep Giri, >>>> +1-(347) 781-4573 (US) >>>> +91-953-899-8962 (IN) >>>> www.CloudxLab.com >>>> >>> >>> >> >