Hive requires you to provide table schema even if you create table based on folder having ORC files (ORC file already has schema internally). It's a shame because ORC is Hive internal project originally
Spark can create table based on ORC or Parquet files automatically without asking you to provide schema. val myTableDf = getSqlContext.read.format("orc").load("s3n://mycompany/hive/warehouse/my_table_orc") myTableDf.printSchema() myTableDf.show() You do not even need to register table to do SQL based ETL against files (added in Spark 1.6) SPARK-11197 SQL Queries on Files - Concise syntax for running SQL queries over files of any supported format without registering a table. https://issues.apache.org/jira/browse/SPARK-11197 I think now it's more clear why all companies move to Spark to do ETL. On Fri, Jan 15, 2016 at 3:06 PM, Alexander Pivovarov <apivova...@gmail.com> wrote: > Probably Bryan can try both Hive and Spark and decide which one better > works for him. > > The fact is - lots of companies migrate from Hadoop/Hive to Spark > > if you like writing ETL using Spark API the you can use map, reduceByKey, > groupByKeym, join, distinct, etc API > if you like using SQL then you can do it by running sqlContext.sq("select > ....") > In addition to SQL you can also use DataFrame API. > > Hive only allows you to use SQL > > BTW, most of Hive UDFs are available in Spark > Plus Spark allows you to create UDF on fly right in your script, e.g. > > sqlContext.udf.register("cube", (in: java.lang.Long) => in * in * in) > > sqlContext.sql("select cube(4) c").show() > > +---+ > | c | > +---+ > | 64| > +---+ > > On Fri, Jan 15, 2016 at 3:03 PM, Mich Talebzadeh <m...@peridale.co.uk> > wrote: > >> Ok but I believe there are other similar approaches. >> >> >> >> I can take a raw csv file and customize it using existing shell commands >> like sed, awk, cut, grep etc among them getting rid of blank lines or >> replacing silly characters. >> >> >> >> Bottom line I want to “eventually” store that csv file in a hive table in >> a format that I can use sql queries on it. >> >> >> >> Is that a viable alternative? >> >> >> >> Thanks >> >> >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility. >> >> >> >> *From:* Marcin Tustin [mailto:mtus...@handybook.com] >> *Sent:* 15 January 2016 21:51 >> >> *To:* user@hive.apache.org >> *Subject:* Re: Loading data containing newlines >> >> >> >> You can open a file as an RDD of lines, and map whatever custom >> tokenisation function you want over it; alternatively you can partition >> down to a reasonable size and use map_partitions to map the standard python >> csv parser over the partitions. >> >> >> >> In general, the advantage of spark is that you can do anything you like >> rather than being limited to a specific set of primitives. >> >> >> >> On Fri, Jan 15, 2016 at 4:42 PM, Mich Talebzadeh <m...@peridale.co.uk> >> wrote: >> >> Hi Marcin, >> >> >> >> Can you be specific in what way Spark is better suited for this operation >> compared to Hive? >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their >> employees accept any responsibility. >> >> >> >> *From:* Marcin Tustin [mailto:mtus...@handybook.com] >> *Sent:* 15 January 2016 21:39 >> *To:* user@hive.apache.org >> *Subject:* Re: Loading data containing newlines >> >> >> >> I second this. I've generally found anything else to be disappointing >> when working with data which is at all funky. >> >> >> >> On Wed, Jan 13, 2016 at 8:13 PM, Alexander Pivovarov < >> apivova...@gmail.com> wrote: >> >> Time to use Spark and Spark-Sql in addition to Hive? >> >> It's probably going to happen sooner or later anyway. >> >> >> >> I sent you Spark solution yesterday. (you just need to write >> unbzip2AndCsvToListOfArrays(file: >> String): List[Array[String]] function using BZip2CompressorInputStream >> and Super CSV API) >> >> you can download spark, open spark-shell and run/debug the program on a >> single computer >> >> >> >> and then run it on cluster if needed (e.g. Amazon EMR can spin up Spark >> cluster in 7 min) >> >> >> >> On Wed, Jan 13, 2016 at 4:13 PM, Gerber, Bryan W <bryan.ger...@pnnl.gov> >> wrote: >> >> 1. hdfs dfs -copyFromLocal /incoming/files/*.bz2 hdfs:// >> host.name/data/stg/table/ >> >> 2. CREATE EXTERNAL TABLE stg_<table> (cols…) ROW FORMAT serde >> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION >> ‘/data/stg/table/’ >> >> 3. CREATE TABLE <table> (cols…) STORE AS ORC tblproperties >> ("orc.compress"="ZLIB"); >> >> 4. INSERT INTO TABLE <table> SELECT cols, udf1(cola), >> udf2(colb),functions(),etc. FROM ext_<table> >> >> 5. Delete files from hdfs://host.name/data/stg/table/ >> >> >> >> This has been working quite well, until our newest data contains fields >> with embedded newlines. >> >> >> >> We are now looking into options further up the pipeline to see if we can >> condition the data earlier in the process. >> >> >> >> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk] >> *Sent:* Wednesday, January 13, 2016 10:34 AM >> >> >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> Thanks Brian. >> >> >> >> Just to clarify do you use something like below? >> >> >> >> 1. hdfs dfs -copyFromLocal /var/tmp/t.bcp hdfs:// >> rhes564.hedat.net:9000/misc/t.bcp >> >> 2. CREATE EXTERNAL TABLE <TABLE> name (col1 INT, col2 string, …) >> COMMENT 'load from bcp file'ROW FORMAT DELIMITED FIELDS TERMINATED BY >> ',' STORED AS ORC >> >> >> >> Cheers, >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Ltd, its subsidiaries nor their employees >> accept any responsibility. >> >> >> >> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov] >> *Sent:* 13 January 2016 18:12 >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> We are pushing the compressed text files into HDFS directory for Hive >> EXTERNAL table, then using an INSERT on the table using ORC storage. We are >> letting Hive handle the ORC file creation process. >> >> >> >> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk <m...@peridale.co.uk>] >> >> *Sent:* Tuesday, January 12, 2016 4:41 PM >> *To:* user@hive.apache.org >> *Subject:* RE: Loading data containing newlines >> >> >> >> Hi Bryan, >> >> >> >> As a matter of interest are you loading text files into local directories >> in encrypted format at all and then push it into HDFS/Hive as ORC? >> >> >> >> Thanks >> >> >> >> >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> *Sybase ASE 15 Gold Medal Award 2008* >> >> A Winning Strategy: Running the most Critical Financial Data on ASE 15 >> >> >> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf >> >> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE >> 15", ISBN 978-0-9563693-0-7*. >> >> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN >> 978-0-9759693-0-4* >> >> *Publications due shortly:* >> >> *Complex Event Processing in Heterogeneous Environments*, ISBN: >> 978-0-9563693-3-8 >> >> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume >> one out shortly >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> NOTE: The information in this email is proprietary and confidential. This >> message is for the designated recipient only, if you are not the intended >> recipient, you should destroy it immediately. Any information in this >> message shall not be understood as given or endorsed by Peridale Technology >> Ltd, its subsidiaries or their employees, unless expressly so stated. It is >> the responsibility of the recipient to ensure that this email is virus >> free, therefore neither Peridale Ltd, its subsidiaries nor their employees >> accept any responsibility. >> >> >> >> *From:* Gerber, Bryan W [mailto:bryan.ger...@pnnl.gov >> <bryan.ger...@pnnl.gov>] >> *Sent:* 12 January 2016 17:41 >> *To:* user@hive.apache.org >> *Subject:* Loading data containing newlines >> >> >> >> We are attempting to load CSV text files (compressed to bz2) containing >> newlines in fields using EXTERNAL tables and INSERT/SELECT into ORC format >> tables. Data volume is ~1TB/day, we are really trying to avoid unpacking >> them to condition the data. >> >> >> >> A few days of research has us ready to implement custom input/output >> formats to handle the ingest. Any other suggestions that may be less >> effort with low impact to load times? >> >> >> >> Thanks, >> >> Bryan G. >> >> >> >> >> >> >> >> Want to work at Handy? Check out our culture deck and open roles >> <http://www.handy.com/careers> >> >> Latest news <http://www.handy.com/press> at Handy >> >> Handy just raised $50m >> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >> led >> by Fidelity >> >> >> >> [image: Image removed by sender.] >> >> >> >> >> >> Want to work at Handy? Check out our culture deck and open roles >> <http://www.handy.com/careers> >> >> Latest news <http://www.handy.com/press> at Handy >> >> Handy just raised $50m >> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> >> led >> by Fidelity >> >> >> >> [image: Image removed by sender.] >> > >