Hi Igor,

Hive can read from zipped files. If you are getting a lot of external files
it makes sense to zip them and store on staging hdfs directory

1) download say these csv files into your local file system and use bzip2
to zip them as part of ETL

 ls -l
total 68
-rw-r--r-- 1 hduser hadoop 7334 Apr 25 11:29 nw_2011.csv.bz2
-rw-r--r-- 1 hduser hadoop 6235 Apr 25 11:29 nw_2012.csv.bz2
-rw-r--r-- 1 hduser hadoop 5476 Apr 25 11:29 nw_2013.csv.bz2
-rw-r--r-- 1 hduser hadoop 2725 Apr 25 11:29 nw_2014.csv.bz2
-rw-r--r-- 1 hduser hadoop 1868 Apr 25 11:29 nw_2015.csv.bz2
-rw-r--r-- 1 hduser hadoop  693 Apr 25 11:29 nw_2016.csv.bz2

Then put these files in a staging directory on hdfs usinh a shell script


for FILE in `ls *.*|grep -v .ksh`
do
  echo "Bzipping ${FILE}"
  /usr/bin/bzip2 ${FILE}
   hdfs dfs -copyFromLocal ${FILE}.bz2 ${TARGETDIR}
done

OK now the files are saved in ${TARGETDIR}

Now create the external table looking at this staging directory. *No need
to tell hive that these files are compressed*. It knows how to handle it.
They are stored as textfiles


DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

*STORED AS TEXTFILE*LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")

Now create the Hive table internally. Note that I want this data to be
compressed. You will tell it to compress the table with ZLIB or SNAPPY


DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER          INT
,PAYMENTDATE            date
,NET                    DECIMAL(20,2)
,VAT                    DECIMAL(20,2)
,TOTAL                  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
*TBLPROPERTIES ( "orc.compress"="ZLIB" )*

Put data in target table. do the conversion and ignore empty rows

INSERT INTO TABLE t2
SELECT
          INVOICENUMBER
        ,
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/yyyy'),'yyyy-MM-dd'))
AS paymentdate
        , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
        , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--        INVOICENUMBER > 0 AND
        CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
-- Exclude empty rows
;

So pretty straight forward.

Now to your question

"it will affect performance, correct?"


Compression is a well established algorithm. It has been around in
databases. Almost all RDBMS (Oracle, Sybase etc) do compress the data at
database and backups through an option. Compression is more CPU intensive
than without it. However, the database will handle the conversion of data
from compressed to none when you read it or whatever. So yes there is a
performance price to pay albeit small using more CPU to uncompress the data
and present it. However, that is a small price to pay to reduce the storage
cost for data.

HTH











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 23:18, Igor Kravzov <igork.ine...@gmail.com> wrote:

> Mich, will Hive automatically detect and unzip zipped files? Ir there is
> special option in table configuration?
> it will affect performance, correct?
>
> On Mon, Jun 6, 2016 at 4:14 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> Hi Sandeep.
>>
>> I tend to use Hive external tables as staffing tables but still I will
>> require access writes to hdfs.
>>
>> Zip files work OK as well. For example our CSV files are zipped using
>> bzip2 to save space
>>
>> However, you may request a temporary solution by disabling permission in
>> $HADOOP_HOME/etc/Hadoop/hdfs-site.xml
>>
>> <property>
>>     <name>dfs.permissions</name>
>>     <value>false</value>
>> </property>
>>
>> There are other ways as well.
>>
>> Check this
>>
>> http://stackoverflow.com/questions/11593374/permission-denied-at-hdfs
>>
>> HTH
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 6 June 2016 at 21:00, Igor Kravzov <igork.ine...@gmail.com> wrote:
>>
>>> I see file are with extension .gz. Are these zipped?
>>> Did you try with unzipped files?
>>> Maybe in order to read the data hive needs to unzip files but does not
>>> have write permission?
>>> Just a wild guess...
>>>
>>> On Tue, May 31, 2016 at 4:20 AM, Sandeep Giri <sand...@cloudxlab.com>
>>> wrote:
>>>
>>>> Yes, when I run hadoop fs it gives results correctly.
>>>>
>>>> *hadoop fs -ls
>>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/*
>>>> *Found 30 items*
>>>> *-rw-r--r--   3 hdfs hdfs       6148 2015-12-04 15:19
>>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/.DS_Store*
>>>> *-rw-r--r--   3 hdfs hdfs     803323 2015-12-04 15:19
>>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670393.gz*
>>>> *-rw-r--r--   3 hdfs hdfs     284355 2015-12-04 15:19
>>>> /data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/FlumeData.1367523670394.gz*
>>>> *....*
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 31, 2016 at 1:42 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> is this location correct and valid?
>>>>>
>>>>> LOCATION '/data/SentimentFiles/*SentimentFiles*/upload/data/
>>>>> tweets_raw/'
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 31 May 2016 at 08:50, Sandeep Giri <sand...@cloudxlab.com> wrote:
>>>>>
>>>>>> Hi Hive Team,
>>>>>>
>>>>>> As per my understanding, in Hive, you can create two kinds of tables:
>>>>>> Managed and External.
>>>>>>
>>>>>> In case of managed table, you own the data and hence when you drop
>>>>>> the table the data is deleted.
>>>>>>
>>>>>> In case of external table, you don't have ownership of the data and
>>>>>> hence when you delete such a table, the underlying data is not deleted.
>>>>>> Only metadata is deleted.
>>>>>>
>>>>>> Now, recently i have observed that you can not create an external
>>>>>> table over a location on which you don't have write (modification)
>>>>>> permissions in HDFS. I completely fail to understand this.
>>>>>>
>>>>>> Use case: It is quite common that the data you are churning is huge
>>>>>> and read-only. So, to churn such data via Hive, will you have to copy 
>>>>>> this
>>>>>> huge data to a location on which you have write permissions?
>>>>>>
>>>>>> Please help.
>>>>>>
>>>>>> My data is located in a hdfs folder
>>>>>> (/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/)  on which I
>>>>>> only have readonly permission. And I am trying to execute the following
>>>>>> command
>>>>>>
>>>>>> *CREATE EXTERNAL TABLE tweets_raw (*
>>>>>> *        id BIGINT,*
>>>>>> *        created_at STRING,*
>>>>>> *        source STRING,*
>>>>>> *        favorited BOOLEAN,*
>>>>>> *        retweet_count INT,*
>>>>>> *        retweeted_status STRUCT<*
>>>>>> *        text:STRING,*
>>>>>> *        users:STRUCT<screen_name:STRING,name:STRING>>,*
>>>>>> *        entities STRUCT<*
>>>>>> *        urls:ARRAY<STRUCT<expanded_url:STRING>>,*
>>>>>> *        user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,*
>>>>>> *        hashtags:ARRAY<STRUCT<text:STRING>>>,*
>>>>>> *        text STRING,*
>>>>>> *        user1 STRUCT<*
>>>>>> *        screen_name:STRING,*
>>>>>> *        name:STRING,*
>>>>>> *        friends_count:INT,*
>>>>>> *        followers_count:INT,*
>>>>>> *        statuses_count:INT,*
>>>>>> *        verified:BOOLEAN,*
>>>>>> *        utc_offset:STRING, -- was INT but nulls are strings*
>>>>>> *        time_zone:STRING>,*
>>>>>> *        in_reply_to_screen_name STRING,*
>>>>>> *        year int,*
>>>>>> *        month int,*
>>>>>> *        day int,*
>>>>>> *        hour int*
>>>>>> *        )*
>>>>>> *        ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'*
>>>>>> *        WITH SERDEPROPERTIES ("ignore.malformed.json" = "true")*
>>>>>> *        LOCATION
>>>>>> '/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw/'*
>>>>>> *        ;*
>>>>>>
>>>>>> It throws the following error:
>>>>>>
>>>>>> FAILED: Execution Error, return code 1 from
>>>>>> org.apache.hadoop.hive.ql.exec.DDLTask.
>>>>>> MetaException(message:java.security.AccessControlException: Permission
>>>>>> denied: user=sandeep, access=WRITE,
>>>>>> inode="/data/SentimentFiles/SentimentFiles/upload/data/tweets_raw":hdfs:hdfs:drwxr-xr-x
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8348)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1978)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.ja
>>>>>> va:1443)
>>>>>>         at
>>>>>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProto
>>>>>> s.java)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>>>>>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>>>>>>         at
>>>>>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>>>>         at
>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>>>>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>> Sandeep Giri,
>>>>>> +1-(347) 781-4573 (US)
>>>>>> +91-953-899-8962 (IN)
>>>>>> www.CloudxLab.com  (A Hadoop cluster for practicing)
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Sandeep Giri,
>>>> +1-(347) 781-4573 (US)
>>>> +91-953-899-8962 (IN)
>>>> www.CloudxLab.com
>>>>
>>>
>>>
>>
>

Reply via email to