Re: Loading data containing newlines

Alexander Pivovarov Fri, 15 Jan 2016 15:20:07 -0800

Hive requires you to provide table schema even if you create table based on
folder having ORC files (ORC file already has schema internally). It's a
shame because ORC is Hive internal project originally


Spark can create table based on ORC or Parquet files automatically without
asking you to provide schema.

val myTableDf =
getSqlContext.read.format("orc").load("s3n://mycompany/hive/warehouse/my_table_orc")

myTableDf.printSchema()

myTableDf.show()


You do not even need to register table to do SQL based ETL against
files (added in Spark 1.6)

SPARK-11197 SQL Queries on Files - Concise syntax for running SQL
queries over files of any supported format without registering a
table.

https://issues.apache.org/jira/browse/SPARK-11197


I think now it's more clear why all companies move to Spark to do ETL.


On Fri, Jan 15, 2016 at 3:06 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Probably Bryan can try both Hive and Spark and decide which one better
> works for him.
>
> The fact is - lots of companies migrate from Hadoop/Hive to Spark
>
> if you like writing ETL using Spark API the you can use map, reduceByKey,
> groupByKeym, join, distinct, etc API
> if you like using SQL then you can do it by running sqlContext.sq("select
> ....")
> In addition to SQL you can also use DataFrame API.
>
> Hive only allows you to use SQL
>
> BTW, most of Hive UDFs are available in Spark
> Plus Spark allows you to create UDF on fly right in your script, e.g.
>
> sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in)
>
> sqlContext.sql("select cube(4) c").show()
>
> +---+
> | c |
> +---+
> | 64|
> +---+
>
> On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <m...@peridale.co.uk>
> wrote:
>
>> Ok but I believe there are other similar approaches.
>>
>>
>>
>> I can take a raw csv file and customize it using existing shell commands
>> like sed, awk, cut, grep etc among them getting rid of blank lines or
>> replacing silly characters.
>>
>>
>>
>> Bottom line I want to “eventually” store that csv file in a hive table in
>> a format that I can use sql queries on it.
>>
>>
>>
>> Is that a viable alternative?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
>> *Sent:* 15 January 2016 21:51
>>
>> *To:* user@hive.apache.org
>> *Subject:* Re: Loading data containing newlines
>>
>>
>>
>> You can open a file as an RDD of lines, and map whatever custom
>> tokenisation function you want over it; alternatively you can partition
>> down to a reasonable size and use map_partitions to map the standard python
>> csv parser over the partitions.
>>
>>
>>
>> In general, the advantage of spark is that you can do anything you like
>> rather than being limited to a specific set of primitives.
>>
>>
>>
>> On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <m...@peridale.co.uk>
>> wrote:
>>
>> Hi Marcin,
>>
>>
>>
>> Can you be specific in what way Spark is better suited for this operation
>> compared to Hive?
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>>
>> *From:* Marcin Tustin [mailto:mtus...@handybook.com]
>> *Sent:* 15 January 2016 21:39
>> *To:* user@hive.apache.org
>> *Subject:* Re: Loading data containing newlines
>>
>>
>>
>> I second this. I've generally found anything else to be disappointing
>> when working with data which is at all funky.
>>
>>
>>
>> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>> Time to use Spark and Spark-Sql in addition to Hive?
>>
>> It's probably going to happen sooner or later anyway.
>>
>>
>>
>> I sent you Spark solution yesterday.  (you just need to write 
>> unbzip2AndCsvToListOfArrays(file:
>> String): List[Array[String]]  function using BZip2CompressorInputStream
>> and Super CSV API)
>>
>> you can download spark,  open spark-shell and run/debug the program on a
>> single computer
>>
>>
>>
>> and then run it on cluster if needed   (e.g. Amazon EMR can spin up Spark
>> cluster in 7 min)
>>
>>
>>
>> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov>
>> wrote:
>>
>> 1.       hdfs dfs -copyFromLocal /incoming/files/*.bz2  hdfs://
>> host.name/data/stg/table/
>>
>> 2.       CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde
>> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION
>> ‘/data/stg/table/’
>>
>> 3.       CREATE TABLE <table> (cols…) STORE AS ORC  tblproperties
>> ("orc.compress"="ZLIB");
>>
>> 4.       INSERT INTO TABLE <table> SELECT cols, udf1(cola),
>> udf2(colb),functions(),etc. FROM ext_<table>
>>
>> 5.       Delete files from hdfs://host.name/data/stg/table/
>>
>>
>>
>> This has been working quite well, until our newest data contains fields
>> with embedded newlines.
>>
>>
>>
>> We are now looking into options further up the pipeline to see if we can
>> condition the data earlier in the process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
>> *Sent:* Wednesday, January 13, 2016 10:34 AM
>>
>>
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Thanks Brian.
>>
>>
>>
>> Just to clarify do you use something like below?
>>
>>
>>
>> 1.  hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs://
>> rhes564.hedat.net:9000/misc/t.bcp
>>
>> 2.  CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …)
>> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY
>> ',' STORED AS ORC
>>
>>
>>
>> Cheers,
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov]
>> *Sent:* 13 January 2016 18:12
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> We are pushing the compressed text files into HDFS directory for Hive
>> EXTERNAL table, then using an INSERT on the table using ORC storage. We are
>> letting Hive handle the ORC file creation process.
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk <m...@peridale.co.uk>]
>>
>> *Sent:* Tuesday, January 12, 2016 4:41 PM
>> *To:* user@hive.apache.org
>> *Subject:* RE: Loading data containing newlines
>>
>>
>>
>> Hi Bryan,
>>
>>
>>
>> As a matter of interest are you loading text files into local directories
>> in encrypted format at all and then push it into HDFS/Hive as ORC?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> *Sybase ASE 15 Gold Medal Award 2008*
>>
>> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>>
>>
>> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>>
>> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
>> 15", ISBN 978-0-9563693-0-7*.
>>
>> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
>> 978-0-9759693-0-4*
>>
>> *Publications due shortly:*
>>
>> *Complex Event Processing in Heterogeneous Environments*, ISBN:
>> 978-0-9563693-3-8
>>
>> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
>> one out shortly
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus
>> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
>> accept any responsibility.
>>
>>
>>
>> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov
>> <bryan.ger...@pnnl.gov>]
>> *Sent:* 12 January 2016 17:41
>> *To:* user@hive.apache.org
>> *Subject:* Loading data containing newlines
>>
>>
>>
>> We are attempting to load CSV text files (compressed to bz2) containing
>> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format
>> tables.  Data volume is ~1TB/day, we are really trying to avoid unpacking
>> them to condition the data.
>>
>>
>>
>> A few days of research has us ready to implement custom  input/output
>> formats to handle the ingest.  Any other suggestions that may be less
>> effort with low impact to load times?
>>
>>
>>
>> Thanks,
>>
>> Bryan G.
>>
>>
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>>
>> [image: Image removed by sender.]
>>
>>
>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>>
>> Latest news <http://www.handy.com/press> at Handy
>>
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
>>  led
>> by Fidelity
>>
>>
>>
>> [image: Image removed by sender.]
>>
>
>

Re: Loading data containing newlines

Reply via email to